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1.  INTRODUCTION  AND  SUMMARY 


The  original  objectives  of  this  program  were  to  explore  optical  methods 
for  performing  eigensystem  calculations  based  on  matrix-vector  or 
matrix-matrix  multiplications.  Specifically,  the  program  called  for  the 
analysis  and  design  of  a  high  accuracy  Acousto-Optic  (AO)  vector-matrix 
multiplier  together  with  pre-  and  post-processing  electronics  and 
input/output  interfaces  to  implement  an  eigensystem  solution  algorithm 
suitable  for  optical  implementation.  The  main  goal  of  the  program  was 
to  explore  high  accuracy  (2  16  bits)  optical-based  special  purpose 
systems  whose  performance  would  exceed,  by  orders  of  magnitude,  the 
current  or  even  projected  performance  of  electronic  systems  such  as 
CMOS,  VHSIC,  GaAs,  etc.  Such  systems  would  find  applications  in  the 
Adaptive  Phased  Array  Radar  (APAR)  area,  which  by  nature,  has  extremely 
high  computational  requirements,  of  the  order  of  10*®- 10*^  M-A/s 
(multiplications/additions  per  second) . 

In  the  first  phase  of  the  program,  available  eigensystem-solution 
algorithms  were  studied,  in  order  to  determine  their  suitability  for  AO 
implementation  (Section  2) .  The  results  of  this  study  showed  that  all 
algorithms,  aside  from  the  matrix  multiplication  part,  require  a 
plethora  of  operations  to  be  carried  out  electronically  rather  than 
optically.  This  is  because  optics  cannot  easily  or  practically  perform 
either  logical  operations  or  certain  arithmetic  operations  such  as 
square  roots  and  divisions.  Such  requirements  make  the  possibly 
efficient  use  of  the  AO  processor  highly  questionable.  In  addition, 
nearly  all  eigensystem  or  direct  APAR  algorithms  require  computational 
accuracies  that  exceed  16  binary  bits.  To  accommodate  such  accuracies 
requires  that  the  AO  processor  be  incorporated  in  a  system  involving 
non-analog  processing  techniques.  Unfortunately,  there  are  not  many 
algorithms  suitable  for  implementing  high  accuracy  multiplications/ 


additions  with  AO  processors  (Section  3) .  The  only  viable  choice  is  the 

DMAC  algorithm  (Digital  Multiplication  via  Analog  Convolution)  and  its 

variations.  Based  on  this  algorithm,  two  novel  AO  architectures  were 

developed,  a  single-detector,  space-integrating  AO  system  and  a 

time -integrating,  systolic  AO  system  (Section  4) .  Utilizing 

state-of-the-art  technology  it  was  estimated  that  both  systems  could  in 

g 

principle  deliver  throughput  rates  of  the  order  of  10  M-A/s.  The 
performance  of  these  systems  was  compared  with  that  of  state-of-the-art 
purely-electronic  counterparts  using  as  figures  of  merit  the  system 
efficiency,  defined  as  throughput  rate  per  unit  power,  and  the 
multiplication  speed  (Section  6) .  A  simple  analysis  showed  that  both 
systems  (as  well  as  other  DMAC  systems  that  have  appeared  in  the  open 
literature)  do  not  offer  any  advantage  over  electronic  systems  that 
could  be  assembled  with  existing  digital  multipliers.  The  reasons  for 
the  relatively  poor  performance  of  the  optical  systems  are:  (1)  the 
serial  nature  of  DMAC  and  (2)  the  fact  that  optics  does  only  part  of  the 
multiplication  while  the  non-optical  part  relies  on  power-consuming  A/D 
converters . 

In  view  of  this  situation,  we  developed  a  Bit  Parallel  Multiplication 
technique  (BP AM)  for  performing  DMAC,  which  eliminates  the  serial 
problem  (Section  3) .  Subsequently,  we  developed  a  novel  space-  and 
time-integrating  BPAM  AO  system  which  is  capable  of  performing 
multiplications  in  a  single  clock  cycle  (Section  4) .  However,  an 
analysis  showed  that  the  multiplication  speed  and  efficiency  are  only  of 
the  order  of  those  already  achieved  with  GaAs  multipliers.  The  reasons 
for  this  are:  (1)  the  nature  of  BPAM  which  incorporates  an  increased 
number  of  A/Ds  and  (2)  the  dimensions  of  the  focused  laser  beam  on  the 
AO  cell.  Thus,  we  concluded  that  neither  DMAC  nor  BPAM  AO  processors 
offer  any  significant  advantage  over  existing  electronic  processors. 
However,  in  conjunction  with  DMAC,  we  developed  a  circularly  polarizing 
sampling  technique  that  allows  for  complex  matrix  operations  with  much 
reduced  time  and  hardware  constraints  (Section  5) .  We  believe  that  this 


technique  can  be  applied  to  any  systolic  or  array  processor  including 
fully  electronic  systems. 


These  initial  results  were  discussed  with  program  monitors  in  meetings 
held  in  early  1985.  As  a  result  of  these  meetings,  it  was  mutually 
decided  to  abandon  all  DMAC  or  BP AM  AO  related  work  and  follow  two  new 
directions  for  the  program  that  were  related  to  the  AFAR  problem.  These 
two  directions  were:  (1)  optically  interconnected  electronic 
multipliers  and  (2)  position-coded  residue  optoelectronic  look-up  table 
(LUT)  processing. 

The  first  topic  involves  optical  interconnection  techniques  to  enable 
high-speed  multi-pin,  electronic  multipliers  to  be  arranged  in  patterns 
that  traditional  micro-strip  interconnects  cannot  handle  (Section  7) . 

For  this  purpose  simple  but  efficient  fiber-optic  splitter/combiner 
techniques  were  studied  and  prototypes  were  developed  in  conjunction 
with  a  4x4  bit  fiber-optically  addressed  digital  multiplier 
(Section  7.2).  These  results,  although  initial  and  largely  incomplete 
due  to  shortage  of  both  funds  and  time,  show  that  existing  low-cost 
fiber-optic  technology  can  be  used  for  globally  interconnecting 
electronic  array  processors.  We  suggest  that  these  ideas  merit  further 
development  via  the  design  and  implementation  of  a  fully  electronic 
addressed  square  array  processor. 

The  second  direction  involves  the  use  of  residue-based,  high-speed  LUTs 
that  can  be  used  for  the  APAE  problem.  Such  LUTs  allow  for  high-speed 
(single  clock  cycle)  flexible  operations  such  as  multiplications, 
additions,  subtractions  and  some  forms  of  division.  We  approached  this 
idea  from  both  the  LUT  and  APAR-LUT  processor  level.  At  the  LUT  level 
we  suggested  a  novel  implementation  of  a  LUT  which  is  based  on  the  use 
of  interlaced  electrode  laser  diodes  or  light-emitting  diodes 
(Section  8) .  This  approach  offers  the  advantage  that  existing 
technology  can  be  used  for  the  implementation  of  GHz-type  operation 
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LUTs.  We  demonstrated  the  capabilities  of  the  laser  diode  LUT  by 
fabricating  and  testing  a  prototype.  We  show  that  data  rates  in  excess 
of  250  MHz  (RZ  data)  can  be  achieved  even  for  discrete  component  LUTs. 

We  believe  that  hybrid  or  monolithic  approaches  should  offer  switching 
speeds  to  allow  data  rates  in  excess  of  1  GHs.  Operation  in  the  residue 
number  system  requires  binary- to-residue  and  residue- to-binary 
conversions.  With  this  in  mind,  we  showed  that  LUT  technology  can  be 
used  in  an  efficient  way  for  both  conversions.  Through  an  example  of  a 
typical  residue  LUT  system,  a  square  array  processor,  we  showed  that  the 
conversions  occupy  about  20%  of  the  hardware.  This  demonstrates  the 
need  to  remain  in  the  residue  domain  for  as  long  as  possible  so  that  the 
conversion-required  hardware  is  a  relatively  small  fraction  of  the 
total.  The  LUTs  require  a  large  number  of  laser  diodes  even  for 
moderate  size  applications  and  it  is  necessary  to  consider  ways  for 
hardware  minimization  both  at  the  LUT  and  system  levels.  We  examined 
LUT  implementation  techniques  in  which  the  number  of  laser  diodes  is 
reduced  by  about  50%.  However,  the  corresponding  number  of 
interconnections  required  per  laser  diode  is  increased  by  a  factor  of 
two  and  we  concluded  that  this  is  not  a  favorable  approach.  At  the 
system  level  we  studied  a  residue  scaling  technique  which  allows  for 
scaling  by  factors  of  about  0.005%  of  the  total  dynamic  range  while 
maintaining  accuracies  of  9-10  bits.  This  seems  to  be  a  viable 
technique  and  further  analysis  is  suggested  in  conjunction  with  specific 
applications. 

For  the  APAR  LUT  problem  we  examined  a  variety  of  non-iterative 
algorithms  for  possible  LUT  implementation  (Section  9) .  We  found  that 
the  only  choice  that  allows  for  residue  LUT  implementation  is  a  variant 
of  the  Gram-Schmidt  orthogonalization  approach.  In  particular,  this 
approach  does  not  require  square  roots  and  it  allows  for  the 
postponement  of  division  until  the  last  processing  step.  We  showed, 
through  examples,  that  this  technique  yields  results  identical  to  those 
obtained  with  straightforward  arithmetic.  We  point  out,  however,  that 


this  technique  requires  a  rather  large  dynamic  range  which  translates  to 
excessive  hardware.  Based  on  this  technique  we  designed  a  LUT-based 
pipelined  processor  which  can  invert  a  6x6  APAB  data  matrix  in  about 
7  clock  cycles.  This  processor  is  a  typical  example  of  the  flexibility 
afforded  by  the  residue  LUT  approach.  However,  we  emphasize  that  these 
are  initial  results  and  that  much  work  is  needed  for  further 
understanding  of  the  LUTs,  the  algorithms,  and  the  concept  as  a  whole. 

We  suggest  that  analyses  are  carried  out  to  clearly  demonstrate  the 
competitiveness  of  the  residue  LUT  approach  compared  with  digital 
pipelined  techniques. 


2.  BIGENSYSTEM  SOLUTION 
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2.1  Introduction 

In  this  section  we  consider  the  application  of  existing  algorithms  for 
determining  the  eigenvectors  and  eigenvalues  Xq  of  a  matrix  C  where 


N 

C  =  E  X 
n=l  1 


44 


(2.1) 


One  application  for  the  eigenanalysis  of  a  matrix  is  in  a  method  for 
solving  the  Adaptive  Phased  Array  Radar  (APAR)  problem  illustrated  in 
Figure  2.1.  In  this  problem,  which  this  program  specifically 
addresses,  we  wish  to  calculate  the  adaptive  weight  vector  w  which 
satisfies  the  system  of  equations 

Cw  =  s  (2.2) 

where  C  is  the  data  covariance  matrix  formed  from  M  successive 
"snapshots"  of  the  data  vector  x(m),  i.e., 


C  =  M'1  E  xT(m)  x(m) 
m=l 


(2.3) 


and  s  is  the  steering  vector  formed  from  the  data  vector  x(m)  and  the 
reference  signal  xq>  i.e., 


s  =  M 


-1 


m=l 


xT(“)  xQ(m) 


(2.4) 
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The  solution  to  Equation  (2.2)  is  written  formally  as 


(2., 


so  that  if  the  complete  eigenvalues  and  eigenvectors  of  C  are  known, 
then  from  Equation  (2.1) 


N  -1  T 

«=  (2  J 
n=l 

The  flow  chart  in  Figure  (2.1)  shows  the  successive  steps  to  be  taken  in 
determining  the  weight  vector  w  which  is  subsequently  used  to  derive  the 
antenna  output  signal  y(m)  given  by 


y(m)  =  wT(m) 


*(») 


(2  -  * 


To  form  the  covariance  matrix  from  the  data  vectors  we  require  M  outer 
products  of  sise  NxN  (where  N  is  the  number  of  elements  in  the  antenna) 
and  M  matrix  additions.  Once  the  covariance  matrix  is  formed,  a 
complete  eigenanalysis  of  the  matrix  is  made  to  determine  the 
eigenvectors  ^  and  eigenvalues  XQ.  The  details  of  this  eigenanalysis 
depend  on  the  particular  algorithm  used;  the  different  algorithms  that 
can  be  used  and  which  are  particularly  suited  to  optical  implementation 
form  the  subject  matter  of  the  remainder  of  this  section.  Once  the 
eigenvectors  and  eigenvalues  are  determined,  the  operations  required  for 
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Figure  2.1  Weight  Determination  for  AP Ait  by  Complete  Eigenanalysis  of 
the  Data  Covariance  Matrix 


forming  the  weight  rector  are  N  rector  inner  products  and  N  rector 
additions.  One  further  rector  inner  product  is  required  to  determine 
the  output  signal  y(a). 


Ve  note  that  all  operations,  learing  aside  the  eigenanalysis,  are  of  the 
type  which  can  readily  be  iapleaented  optically.  Thus,  the  focus  for 
this  aethod  of  solution  of  the  APA2  problem  is  on  the  algorithms 
arailable  for  eigenanalysis  and,  in  particular,  on  those  algorithms  that 
are  aost  suited  to  optical  implementation. 

2.2  Gershgorin  Method 
1  2 

This  aethod  *  inrolres  bounding  the  region  of  space  in  which  the 
eigenralues  are  located.  The  flow  diagraa  used  for  this  aethod  is  given 
in  Figure  2.2.  Thus,  we  first  determine  the  radii,  6 of  the  discs 
which  contain  the  eigenralues,  i.e., 


N 

I  1C.. I, 

j=l  1J 


1  1  i  1  N 


(2.8) 


The  eigenralues  of  C  lie  in  the  union  of  the  discs  of  radii,  6 
centered  at  C^j.  If  a  of  these  discs  are  connected  and  disjoint  from 
the  remaining ,  these  discs  then  contain  exactly  a  eigenralues  of  C. 
This  requires  a  logic  operation. 

These  bounds  can  be  iaprored  through  repeated  siailarity 
transformations.  Thus,  the  first  Gershgorin  disc  D.  is  reduced  by 
using  D  C  D  instead  of  G,  where 


D  =  Diag  ( p , . . .1) 


(2.9) 
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Figure  2.2  Flow  Chart  for  the  Determination  of  Eigenvalue  Regions 
using  the  Gershgorin  Method 


and  e  is  the  magnitude  of  the  largest  off  diagonal  element.  To 
determine  £  requires  a  sorting  operation  in  N(N-l)  elements  and  p 
requires  a  similar  sorting  operation  in  N(N-l)  elements  together  with  a 
division.  Finally,  the  similarity  transformation  requires  two 
matrix-matrix  multiplications  to  derive  D  C  D 

The  advantages  of  this  method  are  that  it  is  simple  and  it  uses  matrix- 
matrix  multiplications,  operations  which  can  be  implemented  optically. 
However,  the  method  finds  only  eigenvalue  regions  and  logic  is  required 
for  isolating  the  eigenvalues,  an  operation  which  optics  cannot 
presently  address.  Further,  there  is  no  definite  order  in  which  the 
eigenvalues  can  be  found.  We  also  note  that  the  method  cannot  be  used 
to  find  eigenvectors.  One  procedure  for  finding  eigenvectors  from 
eigenvalues  is  that  of  inverse  iteration,  discussed  in  Section  2.4. 

Thus,  the  Gershgorin  method  may  be  useful  as  a  first  step  for  other 
methods  such  as  inverse  iteration. 

2.3  Power  Method 

12. 

This  method  '  is  an  iterative  method  and  is  illustrated  in  Figure  2.3. 

We  begin  with  some  guess  z^,  for  the  dominant  eigenvector  and  form 

=  Cz  ^  (2.11 

in  a  matrix-vector  multiplication  operation.  Next,  the  largest  element  of 
v(1+1)  is  determined,  which  involves  a  sorting  operation  in  N  elements 
followed  by  N  scalar  multiplications,  to  derive 
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Figure  2.3  Flow  Chart  for  the  Determination  of  Eigenvalues  and 
Eigenvectors  using  the  Power  Method 
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v(l+l) 

,(1+1)  -  _ z _ 

2  -llv(l>l)ll 


(2.12) 


As  1  *  *,  we  obtain  the  eigenvector  estimate  i.e., 


♦  ♦ 


The  corresponding  eigenvalue  estimate  is  given  by 


(1*1) 


(.«)’  .»> 


(2.13) 


The  disadvantages  of  this  method  are  that  it  is  an  iterative  method 
requiring  logic  and  division  operations  both  of  which  cannot  be  readily 
implemented  using  optics.  The  convergence  rate  for  obtaining  solutions 


4(1*x)  -  *1  (‘  ♦  °[(4f)  1  ) 


(2.14) 


Various  schemes  for  accelerating  convergence  have  been  proposed  such  as 

2 

using  a  shift  of  origin,  i.e.,  using  (C-pI)  instead  of  C,  using  C 

T  T 

instead  of  C,  and  using  the  Rayleigh  quotient  (z^  C  s^)/(z,“  z^)  . 


The  power  method  breaks  down  for  complex  eigenvalues  with  the  same 
modulus. 


2.4  Inverse  Iteration 

1  2 

This  method  *  is  a  variation  of  the  power  method  and  is  best  known  for 
finding  the  eigenvector  corresponding  to  an  eigenvalue  close  to  some  p. 
A  flow  chart  for  this  method  is  given  in  Figure  2.4.  Thus,  knowing  an 
approximate  eigenvalue  p,  we  first  form,  through  N  scalar  additions, 
the  matrix  E  given  by: 


E  =  C  -  pi  (2. IS 

We  then  seek  solutions  to  the  equation 

(0  -  Pi) 

This  is  similar  to  the  power  method  except  that  C  is  replaced  by 
(C-pI)  ^ ,  both  having  the  same  eigenvectors. 

The  next  step  involves  the  triangular  decomposition  of  E  to  give 


E  =  L  U 


(2.17 


The  detailed  operations  involved  in  this  decomposition  are  shown  in  the 
flow  diagram  of  Figure  2.5,  and  involve  scalar  multiplications  and 
additions  together  with  logic  decisions. 
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Figure  2.4  Flow  Chart  of  the  Determination  of  Eigenvectors 

Corresponding  to  Given  Eigenvalues  by  the  Method  of  Inverse 
Iteration 


3-10 


r(  k+1)  r(k) 

L,j 


m,kC 


(k) 


T 


( N-k> 


Scalar  Multiplications 
and  Additions 


Having  decomposed  the  matrix  E  re  solve  the  sets  of  equations 

L  ,  = 


U  =  i  (2. 

by  back  substitution  to  derive  the  eigenvectors  and  eigenvalues. 
Operations  involved  here  are  sorting  in  N  elements  and  N  scalar 
multiplications . 

Disadvantages  of  the  technique  are  that  a  considerable  number  of 
transformations  are  required  and  the  convergence  properties  are  far  from 
satisfactory.  In  addition,  matrices  exist  which  have  no  triangular 
decomposition  in  spite  of  the  fact  that  their  eigenproblem  is 
we 11 -conditioned,  or  whose  triangular  decomposition  is  numerically 
unstable . 

2.5  Q-R  Method 


This  method  ’  has  proven  to  be  the  most  effective  of  known  methods  of 
solving  the  general  eigenvalue  problem.  In  contrast  to  the  method 
discussed  in  Section  2.4,  it  is  based  on  unitary  transformations.  A 
flow  diagram  of  the  approach  is  given  in  Figure  2.8.  The  first  step 
involves  the  Q-R  decomposition  of  the  matrix  to  give  a  factorization 
into  the  product  of  a  unitary  matrix  Q  and  an  upper  triangular  matrix  R 


cd)  =  qd)R(i) 


The  steps  involved  in  this  procedure  are  illustrated  in  Figure  2.7. 
Operations  involve  (N-l)  squares,  (N— 1)  additions,  1  division  and  1 
multiplication  for  each  pass  through  the  loop  together  with  1  outer 
product  and  matrix-matrix  multiplication. 


a-ia 


From  Equation  (2.20)  we  form 


c(l+l)  =  R(l)  Q(l) 


(2.21) 


qf1-1)... 


,(i)  ,(2)  ...  ,0) 


(2.22) 


As  1  +  *, 


C(1)  ♦  Diag  (X1#  X2,  ...  XN) 


(2.23) 


i.e.,  for  large  X  we  have  determined  the  eigenvalue  matrix  A 

0«  =  A 


Thus, 


A  =  QT  C  Q 


(2.24) 


or  rearranging 


C  Q  =  Q  A 


(2.25) 


where  Q  contains  the  eigenvectors. 


The  operations  involve  extensive  matrix-matrix  multiplication  together 
with  logic  decisions. 


2.6  Discussion 


Although  many  algorithms  exist  for  eigenanalysis  of  matrices,  they  all 
involve  extensive  arithmetic  and  logical  operations.  The  basic 
arithmetic  operations  of  addition,  subtraction,  and  multiplication  can 
be  implemented  using  optics.  However,  operations  such  as  division, 
square  roots,  and  logic  decisions  cannot  presently  be  implemented 
optically.  Thus,  the  algorithms  discussed  in  this  chapter  require,  at 
best,  a  hybrid  optical-electronic  approach  for  use  in  a  practical 
system. 

For  such  a  hybrid  system,  even  if  those  operations  which  can  be 
implemented  optically  prove  to  be  executable  at  higher  speed  than  can  be 
obtained  electronically,  the  overall  speed  is  still  dictated  by 
non-optical  operations.  If  continued  switching  between  the  optical  and 
electronic  domains  is  necessary  as  is  the  case  for  executing  these 
algorithms,  the  value  of  the  optical  processing  contributions  to  the 
overall  system  becomes  questionable. 
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3.  ALGORITHMS  POR  HIGH-ACCURACY  ACOUSTO-OPTIC  PROCESSORS 


Optical  systems  performing  multiplication  and  addition  can  be 
implemented  Yia  analog,  binary,  and  residue  techniques.  Analog 
processors,  although  fast,  suffer  from  low  accuracy  and  can  only  be  used 
for  applications  where  8-10  bits  accuracy  is  required.  Due  to  the 
nature  of  our  applications  (eigensystem  solution,  APAR,  etc.)  we  require 
accuracies  well  in  excess  of  10  bits  (e.g.,  16  or  20  bits)  and,  thus,  we 
will  not  consider  analog  processors.  This  Section  covers  binary 
techniques  which  result  in  AO  systems  of  high  digital  accuracy.  Residue 
arithmetic  methods  will  be  discussed  in  Section  8. 

3.1  Digital  Multiplication  Yia  Analog  Convolution  (DMAQ 


3 

Multiplication  of  two  binary  numbers  via  analog  convolution  (DMAC)  is 
based  on  the  novel  idea  of  convolving  the  binary  words  representing  the 
two  numbers.  The  result  is  generated  in  a  mixed  binary  format  where, 
like  binary  arithmetic,  each  digit  is  weighted  by  a  power  of  2;  but 
unlike  binary  arithmetic,  each  digit  can  be  >  1.  The  algorithm  can  be 
best  realised  via  some  simple  examples:  consider  the  calculation  of  the 
products  15*41  and  29*62.  Ve  first  convolve  the  binary  representations 
of  the  numbers  for  each  product.  The  results  of  the  convolutions  are 

(15*41):  [0  0  1  1  1  1]*[1  0  1  0  0  1]  =  [0  0  1  1  2  2  1  2  1  1  1]  (3.1) 

(29*62):  [011101] *[111110]=  [01233432110]  (3.2) 

Next,  we  weight  each  convolution  point  by  a  power  of  2.  Finally  we  sum 
the  weighted  points  to  obtain  the  final  result: 


$ 


15*41  =  [1*2®  ♦  1*27  *  2*2®  ♦  .  .  .  *  1*2°]  =  615 


29*62  =  [1*2®  *  2*2®  ♦  3*27  ♦  .  .  .  +  0*2°]  =  1798 


(3.3) 

(3.4) 


s-i 


Note  that  in  order  to  sua  different  products  (which  is  the  case  for  inner 
products) ,  we  sum  the  corresponding  points  and  subsequently  weight  and 
sua.  For  exaaple: 

(15*41)  +  (29*62)  =  [(1+0)  *2®  +  (2+1) *28  +  (3+1)  *27  +  .  .  .  + 

(1+0) *2°]  =  2413  (3.5) 

The  iaportance  of  this  scheae  is  evident  in  considerations  of  dynaaic 

range  requireaents .  For  exaaple,  to  aultiply  two  nuabers  each  with  a 

16 

dynaaic  range  of  N  =  2X  =  65,536,  we  need  an  output  dynaaic  range  of 
32  9 

N  =  2  =4.3x10.  With  binary  encoding,  the  input  and  output  dynaaic 

range  aust  be  2  (i.e.,  *0*  and  *1*)  and  16  (i.e.,  when  all  16  bits  are 
"1"),  respectively.  Consider  now  the  suaaation  of  50  such  products.  If 
analog  techniques  were  used,  we  would  need  an  output  dynaaic  range  of 
2.1  x  10**.  With  the  binary  scheae,  we  need  input  and  output  dynamic 
ranges  of  2  and  50  x  16  =  800. 

Notice  that  once  the  convolution  data  have  been  generated  (in  analog 
fora),  an  A/D  converter  in  conjunction  with  a  shift-register/ 
accuaulator  can  be  used  to  convert  the  mixed  binary  data  to  binary  data. 
The  A/D  converter  requirement  is  for  log^  bits,  where  k  is  the  maximum 
value  of  the  mixed  binary  data. 

The  DMAC  technique  can  be  extended,  via  a  twos  complement  encoding,  to 
handle  both  positive  and  negative  nuabers.  To  allow  sign  notation,  the 
leftmost  bit  for  each  binary  word  is  the  sign  bit:  0  for  plus  and  1  for 
minus.  Positive  binary  numbers  are  represented  by  their  original  binary 
fora  with  the  addition  of  the  sign  bit.  For  example,  the  integer  +13  is 
represented  by  0  1  101. 

To  represent  a  negative  number  we  first  change  the  sign  bit  of  its 
signed  binary  absolute  value  from  0  to  1.  Next,  we  change  all  the  ones 
to  zeros  and  all  zeros  to  ones  (this  is  the  ones  complement 


representation).  Finally,  we  add  a  1.  For  example,  for  the  integer 
-45,  the  ones  complement  representation  is  1  0  1  0  0  1  0,  which  in  two 
complement  form  becomes  101001  1 .  Conversion  from  the  twos 
complement  representation  to  signed  absolute  value  is  obtained  by 
changing  all  seros  (ones)  to  ones  (zeros)  and  adding  a  1. 

As  is  shown  in  Reference  4,  one  technique  of  multiplying  two  numbers, 
using  twos  complement  binary  representation,  requires  that  the  input 
numbers  be  represented  by  the  same  number  of  bits  required  to  represent 
the  output.  For  example,  consider  the  product  +13  x  -45  =  -585.  To 
represent  the  output,  we  require  a  total  of  11  bits,  including  the  sign 
bit.  To  extend  the  input  numbers  to  11  bits,  we  insert  six  zeros 
between  the  sign  bit  and  the  most  significant  bit  (USB)  of  +13  and  four 
ones  between  the  sign  bit  and  the  USB  of  -45.  Thus  the  input  numbers 
become  00000001101  and  11111010011  for  +13  and  -45 
respectively.  The  product  of  the  binary  numbers  can  now  be  calculated 
by  performing  a  usual-sense  multiplication  with  the  exception  that  any 
bits  generated  to  the  left  of  the  sign  bit  column  are  truncated.  An 
example  of  this  procedure  is  shown  in  Figure  3.1  for  the  case  of  the 
product  +13  x  -45.  The  result  is  expressed  in  a  mixed  binary  form  and 
it  can  be  converted  to  twos  complement  representation  by:  divide  the 
least  significant  mixed  binary  bit  by  modulus  2,  add  the  quotient  to  the 
next  bit,  divide  by  2,  add  the  quotient  to  the  next  bit,  divide  by  2, 
add  the  quotient  to  the  next  bit . etc. 

The  remainders  of  these  series  of  operations  constitute  a  binary  word 
which  is  the  twos  complement  representation  of  the  mixed  binary  output. 
Note  that  the  remainder  of  the  first  division  is  the  LSB  and  the 
remainder  of  the  last  division  is  the  sign  bit  of  the  so-obtained  twos 
complement  binary  word.  An  example  of  this  procedure  is  shown  in 
Figure  3.2  for  the  case  of  the  product  +13  x  -45  =  -585. 
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As  Figure  3.2  suggests,  the  result  is  10110110111  which  is  the 
twos  complement  representation  of  the  number  -585.  Note  that  if  the 
mixed  binary  result  is  generated  in  the  form  of  an  analog  signal,  then 
the  device  necessary  for  the  conversion  is  an  A/D  converter  followed  by 
s  shift  register/accumulator . 

As  mentioned  earlier,  the  mixed  binary  form  of  the  output  allows 
addition  of  different  products  without  the  need  for  carries.  To 
illustrate  this,  consider  the  summation  of  the  products  (-*-13  x  -45)  and 
(+13  x  -10) .  The  mixed  binary  representation  of  the  product 
+13  x  -10  =  -130  is  33332231110.  Addition  of  this  result  to 
the  one  that  corresponds  to  the  product  +13  x  -45  yields 

33222022111 
+  33332231110 

66554253221 

Conversion  of  the  mixed  binary  result  to  twos  complement  gives  1010 
0110101  which  is  the  twos  complement  representation  of  the 
number  (-585)  +  (-130)  =  -715. 

From  the  above  brief  discussion  it  is  apparent  that  incorporation  of 
the  DMAC-twos  complement  arithmetic  by  optical  processors,  solves  two 
major  problems;  specifically,  accuracy  and  bipolar  number  handling. 
However,  note  that  the  use  of  such  algorithms  results  in  a  major 
sacrifice  in  the  processor’s  time-bandwidth  product  (TBW) , 
specifically,  a  reduction  by  at  least  a  factor  of  2N-1,  where  N  is  the 
number  of  bits  in  the  input.  This  is  due  to  the  nature  of  the 
serial-type  convolution  which  requires  2N-1  clock  cycles  for  its 
completion.  Note  that  aside  from  the  TBW  reduction,  we  undergo  a 
multiplication  speed  reduction  (as  compared  to  the  clock) . 

Specifically,  the  digital  multiplication  time  required  is  T^  x  (2N-1) 


where  is  the  clock  pulse-width.  These  issues  are  discussed  in  more 
detail  in  Section  6. 

3.2  Bit  Parallel  Multiplication  (BPAiO 

To  avoid  the  DMAC  problems,  we  have  developed  a  bit-parallel  digital 
multiplication  (BPAM)  algorithm.3  This  can  be  explained  via  a  simple 
convolution  example.  Suppose  we  wish  to  convolve  the  sequences 
A^AgAjAg  and  BgBjBgBg  where  A^,B^  i  =  0, 1,2,3  are  digits  of  binary 
value.  Since  N=4,  we  should  obtain  2N-1  =  7  convolution  points.  A 
rigorous  implementation  of  the  convolution,  shows  that  the  7 
convolution  points  are 

p°  -  V>o 

P  =  A1Bq  ♦  AqB1 

?l  =  Vo  *  A1B1  +  A0®2 

P  =  Vo  +  Vl  +  A1B2  +  A0B3  (3*6) 

Pc  =  Vl  +  V2  *  A1B3 

+  V3 


From  Eq.  (3.7)  we  can  observe  the  following: 

0  6 

(1)  The  output  convolution  points  (Pu  through  P°)  are  linear 
combinations  of  various  A^Bj  products  (e.g.,  P^  =  AjBq  +  AqB^) . 

(2)  If  all  AjBj  products  are  available  in  parallel,  one  can  form  the 
output  convolution  points  by  summing  properly. 


(3)  If  the  products  and  the  various  product  summations  can  occur  in 
parallel,  then  the  time  required  for  digital  multiplication  is  no 
longer  2N-1  clock  cycles  but  rather  1  clock  cycle. 


From  the  above  we  see  that,  in  principle,  a  BP AM  can  be  formed  in  a 

single  clock  cycle  as  long  as  all  the  A.Bj  input  bit  combinations  are 

available  in  parallel.  Note  that  the  summation  of  M  different  number 

products  (i.e.,  inner  product)  can  be  achieved  in  a  way  similar  to  the 

one  for  DMAC;  i.e.,  sum  in  parallel  all  the  A.  B.  ,  i,j  =  1,2,....N, 

lm  jm 

m  =  1,2,...2N-1  convolution  points. 

The  BPAM  approach  solves  at  least  two  major  problems  (as  compared  with 
DMAC);  namely,  (1)  TBW  reduction  and  (2)  net  multiplication  speed. 
However,  it  creates  a  problem  which  is  absent  in  DMAC;  namely  ,  it 
requires  NxN  output  points  for  a  single  multiplication,  which  translates 
to  a  high  output  resolution  requirement.  Nevertheless,  it  is  the  only 
binary  technique  that  guarantees  both  high  speed  and  accuracy. 


4.  OPTICAL  ARCHITECTURES  FOR  DMAC  AND  BP AM 


In  this  Section  we  discuss  a  number  of  possible  optical  architectures 
which  we  have  developed,  in  conjunction  with  the  algorithms  of 
Section  3.  The  first  family  of  processors  (space-integrating 
Acousto-Optic  processor  and  time-integrating  Acousto-Optic  processor) 
represent  systems  that  are  based  on  the  serial-type  convolution.  The 
performance  of  these  processors  is  typical  of  that  expected  from  systems 
which  utilize  serial  DMAC.  These  processes  can  be  fabricated  using 
present  custom  technology.  The  second  family  of  processors  (BP AM 
Acousto-Optic  processor)  represent  systems  that  are  based  on 
bit-parallel  digital  multiplication  (BPAM) .  Unlike  the  first  class  of 
systems  (which  require  2N-1  clock  cycles  for  the  formation  of  the 
convolution)  the  multiplication  is  formed  in  a  single  clock  cycle 
thereby  greatly  enhancing  the  net  multiplication  speed.  These 
architectures  are  good  examples  of  BPAM  optical  processors  that  can  be 
fabricated  with  present  technology.  Furthermore,  these  architectures 
should  serve  as  a  guide  to  the  performance  that  may  be  expected  from  a 
BPAM  processor. 

4.1  DMAC  Acousto-Optic  Space-Integrating  Processor 

A  simple  binary  number  multiplication  can  be  achieved  using 
Acousto-Optic  (AO)  techniques  and  the  serial  convolution  scheme  of 
Section  3.1.  Suppose  we  want  to  multiply  two  numbers;  A  and  B,  each 
represented  by  N  bits.  The  bits  of  both  numbers  are  made  available 
serially  (as  a  function  of  time)  and  named  S^(t)  and  S^(t), 
respectively. 


The  optical  system  shown  in  Figure  4.1  is  composed  of  two  AO  cells 
arranged  in  a  counter-propagation  configuration  (i.e.,  the  sound  waves 
travel  in  opposite  directions) .  The  first-order  diffracted  beam  from 
A01  is  imaged  on  A02  through  a  pair  of  lenses  and  a  spatial  filter 
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Figure  4.1  Acousto-Optic  Processor  for  Binary  Number  Multiplication 


(not  shown) .  The  product  of  the  resulting  distributions  is  imaged  onto 
cylindrical  lens  through  a  second  pair  of  lenses  and  a  spatial 
filter.  A  detector  follows  a  slit,  which  is  placed  at  the  back  focal 
plane  of  L^,  so  that  only  the  DC  part  of  the  resulting  Fourier  Transform 
is  detected.  Thus,  the  system  part  L^-slit-detector  is  used  in  order 
to  form  the  integration  of  the  product  of  the  data  present  in  A01  and 
A02.  The  data  S^(t)  and  Sg(t)  are  applied  simultaneously  onto  the  AO 
cells.  At  every  instant  of  time,  lens  forms  the  summation  over  the 
bit-by-bit  products  S^*Sg.  The  resulting  light  is  detected  by  the 
detector.  Because  the  data  S^(t)  and  ST. (t)  are  moving  in  opposite 
directions,  the  light  incident  on  the  detector  is  proportional  to  the 
convolution  S^(t)*Sg(t).  Consequently,  the  output  of  the  detector  is 
proportional  to  the  convolution  values.  Since  the  data  and 
composed  of  N  bits,  the  convolution  is  composed  of  2N-1  parts  each 
triangular  in  shape,  under  the  assumption  that  the  bits  are  represented 
by  square  pulses.  Note  that  the  maximum  value  of  the  convolution  occurs 
when  all  and  Sg  data  are  present  in  the  AO  cells.  This  corresponds 
to  the  highest  triangle  of  Figure  4.1.  Thus,  we  see  that  the  simple 
arrangement  of  Figure  4.1  performs  the  first  step  of  the  binary 
algorithm;  namely,  the  convolution.  To  obtain  the  product  A*B,  we  have 
to  weight  each  convolution  point  by  a  power  of  2  and  then  sum  the 
results.  This  can  be  achieved  if  an  A/D  converter  follows  the  detector 
and  feeds  its  output  to  a  digital  shift- register/accumulator .  When  all 
2N-1  convolution  parts  have  been  accumulated,  the  values  of  the  shift 
register  are  read  out.  This  binary  word  corresponds  to  the  product  A«B 
with  an  accuracy  of  2N  bits  (each  number  is  represented  by  N  bits  in  the 
input) . 

We  can  now  expand  the  binary  number  multiplier  system  of  Figure  4.1  to  a 
multi-channel  system  for  vector-matrix  multiplication.  Suppose  we  want 
to  multiply  the  vector  b^jbg^,  •  •  •  •  ,by^  with  the  matrix  A  consisting  of 
elements  a^  where  i  =  l,2....k  and  j  =  1,2,....M.  The  result  will  be  M 
inner  products  C^j ,  where  i  1,2, . . . . M . 
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2  Space  Integrating  Acousto-Optic  Processor  for  Accurate 
Vector-Matrix  Multiplication 
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The  optical  system  shown  in  Figure  4.2  is  similar  to  the  optical  system 
of  Figure  4.1,  except  that:  (1)  A01  and  A02  are  M-channel  cells  and  (2) 
lens  is  spherical  rather  than  cylindrical.  If  a  cylindrical  lens  is 
substituted  in  place  of  then  the  system  is  an  exact  multi-channel 
version  of  the  system  of  Figure  4.1.  Thus,  M  detectors  placed  in 
parallel  at  the  focal  plane  and  at  locations  x  =  0,  y  =  y^,  y  = 

. . .y  =  yM  (where  7^72*7^' ‘ ' are  fc^e  l°cati°ns  of  the  M  AO  cell 
channels  across  y) ,  record  k  parallel  convolutions: 


all*bll’ 


a12*b21’ 


'  hk*1^! 


(4.1) 
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If  these  convolutions  are  added  and  weighted,  the  result  corresponds 
to  a  single  inner  product,  Cj^.  This  is  exactly  the  function  of  the 
power  of  the  spherical  lens  along  y.  Thus,  the  lens 
accomplishes  two  tasks:  (1)  it  performs  the  necessary  convolution 
integral  along  x  and  (2)  it  sums  the  various  convolution  points  along 
y  similar  to  the  operation  shown  in  the  parentheses  of  Eq.  (3.5). 

This  operation  is  allowed  because  of  the  mixed-binary  format  of  the 
output  (resulting  from  the  binary  multiplication  scheme)  which  allows 
for  product  summation  without  the  need  for  carries.  Thus,  the  output 
of  the  detector  corresponds  to  a  convolution  which,  after  the  required 
post-processing,  is  equal  to  the  inner  product: 


'11 


=  [a11bn  +  a12b21  * 


+  a. 


lk1^ 


(4.2) 


To  obtain  the  second  inner  product  Cjj,  the  vector  bll'b21 . ^1  is 

loaded  onto  A02  while  the  vector  *21* *22*  •  •  •  «a2M  *s  loa<*e<*  onto  A01> 
This  procedure  is  repeated  M  times  until,  all  C.^,C  =  1,  2, . . .  ,M  inner 
products  (that  correspond  to  the  vector-matrix  product  A»B)  are 
obtained.  Similarly,  and  in  order  to  obtain  a  full  matrix-matrix 
multiplication,  this  procedure  is  repeated  li  x  II  times. 
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We  see  that  the  system  of  Figure  4.2  is  a  real-time  high-accuracy 
vector-vector  multiplier  which  can  be  used  for  either  vector-matrix  or 
matrix-matrix  multiplication.  From  the  systems’  point  of  view,  the 
above  multiplier  offers  another  very  significant  advantage,  namely,  a 
single  output,  since  a  single  detector  is  used  for  inner  product 
detection.  Consequently,  the  required  interface  with  a  digital 
microprocessor  is  very  simple. 

It  is  worthwhile  to  mention  that  if  M  detectors  are  used  (in  conjunction 
with  a  cylindrical  lens,  L^)  the  system  can  still  calculate  inner 
products.  In  this  case  the  necessary  product  summation  must  be  carried 
out  digitally.  This  obviously  increases  the  complexity  of  the 
electronic  post-processing  as  well  as  of  the  interface,  but  it  offers 
some  additional  flexibility.  Specifically,  it  allows  for  separate 
operations  over  the  various  input/output  channels,  just  like  a 
conventional  digital  array  processor.  Whether  one  wants  to  use  a 
processor  with  a  single  detector  or  M  detectors  is  a  question  that 
depends  on  the  specific  algorithms  used  and  can  be  answered  only  when  a 
specific  analysis  of  existing  algorithms  is  made  in  conjunction  with  the 
architectural  choices. 

4.2  DMAC  Acousto-Optic  Space-Integrating  Processor  Characteristics 

The  binary  algorithm  allows  the  processor  to  have  input  and  output 
dynamic  ranges  of  N  and  2N  bits,  respectively.  The  component  dynamic 
range  requirements  are  2:1  for  the  AO  cells  and  NxM:l  for  the  detector 
in  a  single-detector  system  or  N:1  in  a  multi-detector  system.  This  is 
because,  for  full  inner  product  formation,  the  maximum  possible  value  of 
the  output  convolution  is  N  x  M,  which  occurs  when  all  M  input  numbers 
have  all  their  input  bits  at  logic  ,1*.  Consider  now  a  specific  example 
of  M  =  128  and  N  =  8  (18  bits  output) .  Then  NxM  =  1024,  which  means 
that  for  a  single  detector  system,  the  dynamic  range  of  the  detector 
needs  to  be  at  least  1024:1.  In  practice,  however,  the  dynamic  range 


of  the  detector  should  be  higher  in  order  to  avoid  detection  errors 

(which  might  be  severe  if  one  takes  into  account  the  post-processing 

stage  where  each  convolution  point  is  weighted  by  a  factor  of  2) .  A 

6 

simple  statistical  analysis  shows  that,  in  order  to  keep  the  bit  error 

_7 

probability  to  <  2.9  x  10  ,  the  dynamic  range  of  the  detector  needs  to 

be  increased  by  a  factor  of  10,  which  corresponds  to  10,000:1.  This 
requires  a  detector  with  40  dB  dynamic  range,  which  is  commercially 
available.  On  the  other  hand,  if  a  128-detector  system  is  to  be  used, 
the  maximum  value  of  the  convolution  is  N,  which  corresponds  to  a 
detector  dynamic  range  of  10  x  8:1  or  19  dB.  It  is  evident  that  the 
detector  requirements  are  not  severe  and  can  be  met  with  commercially 
available  devices. 

To  avoid  computational  errors,  both  AO  cells  should  be  very  uniform  over 
their  entire  apertures.  This  requirement  is  significant,  especially  for 
the  single  detector  system,  and  comes  about  because  of  the  2-D  spatial 
integration  used.  If  non-uniform  devices  are  used,  convolution  points 
of  the  same  analog  values  will  correspond  to  different  light  levels,  and 
spatial  integration  will  consequently  yield  an  incorrect  output. 

Initial  analysis  shows  that  the  uniformity  required  should  be  better 
than  N  x  11:1  and  tends  to  approach  the  dynamic  range  of  the 
detector(i.e. ,  10  x  N  x  M:l).  The  uniformity  requirement  of  the  AO 
cell  is  achievable  because  the  required  AO  cell  time-aperture  is 
relatively  small  (i.e.,  for  N  =  9  and  bit-width  of  10  nsec,  the  required 
aperture  is  0.08  ^sec) .  The  small  apertures  associated  with  low 
acoustic  attenuation  AO  crystals  result  in  a  very  uniform  acoustic 
field,  which  corresponds  to  the  propagating  bit-stream.  On  the  other 
hand,  because  of  the  small  aperture,  acoustic  diffraction  is 
controllable.  Note  that  the  effects  of  acoustic  diffraction,  in 
conjunction  with  the  algorithm  used,  can  be  severe  (this  is  explained  in 
detail  in  Reference  7) . 
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The  A/D  converter  requirement  is  lof^CN  x  k)  bits  for  a  single  detector 
system  or  l’og2(N)  bits  for  an  k-detector  system.  For  example:  for 
N  =  8  and  k  =  128,  log^Nxk)  =  10  and  log^N)  =  3.  These  A/D 
requirements  are  easily  met  with  commercially  available  devices. 

The  throughput  rate  of  the  system  depends  on  several  factors:  (1) 
number  of  input  channels,  (2)  number  of  output  channels,  (3)  number  of 
input  bits,  (4)  input  bit  width  and  (5)  number  and  speed  of  available 
A/D  converters.  Let  us  calculate  the  throughput  rate  of  the  system 
based  on  rather  optimistic  data.  Ve  assume  the  availability  of  a  Bragg 
cell  with  128  input  channels  and  use  128  output  channels  (i.e.,  the  most 
flexible  version  of  Figure  4.2  where  lens  is  cylindrical). 

For  N  =  8  and  a  bit  width  Tg  =  100  nsec,  the  total  time  required  for 
formation  of  a  single  inner  product  is 

T  =  (2N-1)  Tfi  =  1.5  fisec  (4.3) 

where  the  extra  (N-l)Tg  time  represents  the  total  duration  of  seros 
which  follow  the  N  bits.  This  is  required  in  order  to  separate  the 
different  inner  products.  During  this  time,  the  system  (with  k  =  128) 
has  performed  128  multiplications.  Thus,  the  throughput  rate  of  the 
system  is 

R  =  128  k-A/1.5  x  10~8  sec  =  8.5  x  107  k-A/sec  (4.4) 

To  improve  the  throughput  rate  of  the  system,  we  need  to  decrease  T, 
which  implies  that  we  need  to  decrease  the  input  bit  width.  For  example, 
for  Tg  =  3  nsec,  the  throughput  rate  of  the  system  is  R  = 

2.84  x  109  k-A/sec  or  2.8  GOPS. 

For  this  scenario  with  Tg  -  3  nsec,  the  output  A/D  requirements  are: 

(1)  128  3-bit  A/D’s  with  a  speed  of  300  kHz  (for  the  multichannel  output 


version)  and  (2)  a  single  10-bit  A/D  with  a  speed  of  300  MHz  (for  the 
single  output  version) .  Clearly  the  single-channel  output  version  is 
impractical  since  10-bit  A/D’s  at  300  MHz  are  not  presently  available. 

On  the  other  hand,  the  multi-channel  output  version,  although  difficult, 
is  more  realistic. 


4.3  DMAC  Acousto-Ootic  Time-Inteeratina  Processor 


The  basic  unit  of  this  processor  is  the  classical  time-integrating  AO 
processor  whose  schematic  diagram  is  shown  in  Figure  4.3.  We  first 
describe  the  operation  of  the  unit  for  the  formation  of  a  single  product 
(e.g.,  +13  x  -45)  via  the  twos  complement  scheme.  The  AO  cell  is 
driven  by  the  binary  data  a  =  +13  in  a  bit-serial  mode  (Figure  4.4). 

The  sign  bit  is  applied  first.  At  time  t  =  t^  all  bits  that  correspond 
to  the  number  +13  have  been  loaded  into  the  cell.  The  so-created 
spatial  distribution  is  Schlieren  imaged  onto  a  time-integrating  linear 
detector  array.  The  array  consists  of  N  elements,  where  N  is  the  number 
of  output  bits  (e.g.,  for  our  example  N  =  11).  At  time  t  =  t^  the 
binary  data  that  correspond  to  the  number  b  =  -45  are  applied  onto  the 
laser  diode  in  a  bit-serial  mode.  These  data  are  applied  such  that  the 
LSB  is  first.  The  resulting  pattern  is  time-integrated  by  the  detector 
array.  At  time  t  =  the  data  in  the  AO  cell  have  moved  by  distance  dg 
which  corresponds  to  a  time-delay  Tg  equal  to  the  duration  of  a  bit  plus 
a  zero  (i.e.,  Tg  =  tg-t^,  see  figure  4.4).  At  the  same  time  the  second 
bit  (i.e.,  LSB  +  1)  of  number  b  is  applied  onto  the  laser  diode.  A  new 
pattern  is  created  which  is  added,  by  the  detector  array,  onto  the 
already  existing  pattern.  At  time  t  =  tg,  a  new  pattern  leaves  the  AO 
cell,  is  added  by  the  detector,  etc.  Thus,  after  time  t  =  t^-tj  ♦  Tg 
the  last  (i.e.,  11th)  pattern  has  been  created  and  added.  At  this 
point,  the  values  of  the  N  elements  of  the  detector  are  read.  A  close 
inspection  of  Figure  4.4  shows  that  the  readout  charge  has  a  value 
(function  of  element)  that  corresponds  to  33222022111,  which 
is  the  mixed  binary  form  of  the  number  +13  x  -45  =  -585.  These  analog 
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values  are  consequently  converted  into  twos  complement  representation 
via  the  A/D/shift-register/accumulator  unit. 

We  emphasize  two  important  points:  (1)  each  bit  in  either  a  or  b  data 
is  followed  by  a  zero.  This  is  necessary  in  order  to  separate  (onto  the 
detector  array)  different  output  bits,  and  is  due  to  the  fact  that  while 
the  data  b  are  applied,  data  a  are  moving.  (2)  the  LSB  of  the  word  a  is 
followed  by  2N  zeros  of  total  duration  T  =  NTg.  This  is  necessary  in 
order  to  distinguish  between  different  products;  that  is,  if  the  LSB  of 
a  is  followed  by  the  sign  bit  of  a  number  c,  the  output  would  be 
incorrect  due  to  contributions  from  the  product  be. 

The  unit  we  have  described  serves  as  a  simple  two-number  multiplier. 
Extension  to  vector-vector  multiplication  (via  inner  product  formation) 
can  be  achieved  via  the  multi-unit  architecture  of  Figure  4.5.  In  this 
case  an  array  of  M  laser  diodes  in  conjunction  with  an  M-channel  AO  cell 
is  used.  Each  of  the  a.  (i  =  1,  2,  . . . ,  M)  elements  of  vector  a  drives 
a  different  AO  cell  channel.  Similarly,  each  of  the  b^  (i  =  1,  2,  ..., 
M)  elements  of  vector  b  drives  a  different  diode. 

For  the  time  being  let  us  ignore  the  cylindrical  lens.  Instead,  let  us 
assume  that  the  AO  cell  is  imaged  onto  an  MxN  element,  2-D  detector 
array.  This  system  is  basically  a  multi-channel  version  of  the  system 
of  Figure  4.3,  and  it  provides  M  products  a^  x  b^,  i  =  1,  2,  . ..,  U. 
Summation  of  these  products  results  in  an  inner  product  a^b^  +  . . .  + 
a^by.  This  summation  is  accomplished  via  the  integration  property 
(spatial  integration)  of  the  cylindrical  lens  and  is  valid  because  of 
the  mixed  binary  form  of  the  output,  which  allows  for  product  summation 
without  the  need  for  carries.  The  resulting  pattern  is  consequently 
time-integrated  by  the  N-element,  1-D  detector  array  placed  at  the  back 
focal  plane  and  at  location  f  =0. 
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By  utilizing  the  delay  properties  of  a  larger  aperture  M-channel  AO 
cell,  in  conjunction  with  an  MxK  array  of  laser  diodes,  we  can  extend 

i 

the  system  of  Figure  4.5  to  the  systolic  processor  of  Figure  4.8.  This 
system  at  peak  operation  can  provide,  in  parallel,  K  inner  products 
^ilail  +  +  i  =  1»  2,  ...,  K  which  are  formed  via  the  space- 

integration  of  the  lens  and  the  time-integration  of  the  detector  array. 
Note  that  in  this  case  the  1-D  detector  array  needs  to  have  KxN 
elements.  Also  note  that  if  the  laser  diodes  illuminate  adjacent  cell 
areas  (i.e.,  along  x) ,  only  K/2  inner  products  are  formed  during  each 
data  cycle.  This  is  because  of  the  requirement  that  the  LSB  of  the  b^ 
data  be  followed  by  2N  zeros  of  total  duration  T  =  NTg.  We  elaborate  on 
these,  as  well  as  other,  system  issues  in  the  following  Section. 


4.4  DMAC  Acousto-Optic  Time-Integrating  Processor  Characteristics 

Because  of  the  algorithm  used,  the  system  described  in  Section  4.3  is 

capable  of  forming  high-accuracy,  inner  products  between  bipolar-value 

vectors.  The  output  accuracy  is  N  bits,  which  corresponds  to  a  dynamic 
N 

range  of  20  log  (2  ) .  To  fully  utilize  this  accuracy,  however,  one  has 
to  minimize  possible  detection  errors,  which  can  be  large  if  we  consider 
the  post-processing  stage  where  we  effectively  weight  each  mixed  binary 
bit  by  a  power  of  2.  The  key  point  is  to  minimize  the  maximum  value 

accumulated  in  the  detector  array,  which  is  NxM.  For  minimum  detection 
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errors  (bit  error  probability  of  <  2.9  x  10  ),  the  detector  array 

should  have  a  dynamic  range  of  at  least  10xMxN:l.  This  requirement 

effectively  constrains  both  N  and  M.  Readily  available  state-of-the-art 

detectors  have  a  dynamic  range  of  better  than  35  dB  which,  in  principle, 

allows  systems  with  N  =  U  =  18.  The  throughput  rate  of  such  a  system 

highly  depends  on  the  data  loading  rate.  Currently  available  AO  cells 

and  laser  diodes  allow  for  bit  widths  down  to  2  nsec.  This  means  that 

for  N  =  18,  we  need  at  least  2x16x2  nsec  =  84  ns  for  product  formation, 

after  loading  the  data  in  the  AO  cell.  Consequently  a  system  with  N  =  K 

=  M  =  16  has  a  throughput  rate  of  2  x  10  M-A/sec.  The  A/D  requirements 


depend  on  the  number  of  bits  we  are  reading  out  which  in  turn  depends  on 
the  location  of  the  detector  element  we  are  reading  out.  For  example, 
the  maximum  possible  value  for  the  sign-bit  element  is  MxN,  for  the  MSB 
is  Mx(N-l),  for  the  MSB-1  is  Mx(N-2),  etc.  To  read  these  values,  the 
A/D’s  need  to  have  loggCMxN),  logjCMxCN-l)) ,  ...,  etc.,  bits, 
respectively.  A  simple  analysis  shows  that,  in  order  to  read  out  8 
inner  products,  we  need  84  A/D’s  at  8  bits,  32  A/D’s  at  7  bits,  16  A/D’s 
at  6  bits,  ...,  etc.  These  A/D’s  need  to  operate  at  a  minimum  input 
frequency  of  ~  18  MHz. 

Other  system  issues  we  need  to  consider  are:  (1)  laser  diode 

collimation  and  (2)  output  detectors.  The  former  depends  on  both  N  and 

the  sound  speed  in  the  AO  crystal.  For  an  efficient  system  a  GaP  AO 

cell  should  be  used  because  of  its  good  diffraction  efficiency 

(>  30%/RF  watt)  and  wide  bandwidth  (>  500  MHz  3-dB  bandwidth) .  In  this 

6 

material  the  sound  speed  is  v  *  6.3  x  10  mm/sec  which  implies  that, 

s 

for  N  =  16  and  bit  +  zero  width  of  Tg  =  4  nsec,  the  total  time-duration 

of  a  binary  word  is  T|  =  16  x  4  nsec  =  64  nsec.  This  corresponds  to  a 

distance  of  ~  0.4  mm  over  which  a  single  laser  diode  needs  to  be 

collimated.  An  additional  constraint  is  the  maximum  allowable  light 

cross-talk  between  adjacent  diodes.  To  avoid  detection  errors  the 

cross-talk  should  be  down  by  the  same  order  as  the  dynamic  range  of  the 

detector;  e.g.,  for  N  =  K  =  16,  M  =  16,  it  should  be  at  least  -34  dB. 

These  requirements  can  be  efficiently  met  via  the  use  of  a  fiber-optic 
8 

fan-out  in  conjunction  with  an  array  of  miniature  graded-index 
collimation  lenses.  Note  that  at  any  instant,  every  other  laser  diode 
is  off.  In  principle,  one  can  take  advantage  of  this  by  turning  off  the 
corresponding  detector  arrays.  This  would  guarantee  absence  of  cross- 
talk-related  detection  errors.  To  achieve  this,  one  needs  detectors 
that  can  operate  at  the  binary  word  switching  frequency.  For  N  =  16  and 
Tg  =  4  nsec  the  switching  frequency  is  f  =  1/(16  x  4  nsec)  ~  16  MHz, 
which  implies  that  the  detectors  should  have  an  integration  time  of  64 
nsec  or  a  clock  frequency  of  16  MHz.  Parallel-readout  detector  arrays 
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with  the  above  characteristics  can,  in  principle,  be  made  with  current 
technology.  An  alternative  solution  involves  a  fiber-optic  fan-out  in 
conjunction  with  separate  detector  elements.  This  solution  guarantees 
not  only  the  proper  integration  time  but  also  minimal  detector  cross¬ 
talk  which,  in  turn,  minimizes  unwanted  detection  errors. 

4.5  BP AM  Acousto-Optic  Processor 

In  this  section  we  describe  one  possible  architecture  which  we  have 
developed  for  implementing  BP AM.  Consider  the  AO  processor  of 
Figure  4.7.  Light  from  4  different  laser  diodes  at  wavelengths 
Xq,X^,\2>X£  is  multiplexed  in  a  fiber  using  conventional  fiber-optic 
techniques.  The  light  level  from  each  diode  has  a  value  proportional  to 
A^,  i  =  0, 1,2,3.  Thus  light  at  Xq  has  a  value  proportional  to  Aq,  at  X^ 
proportional  to  A^,  etc.  Thus,  if  a  4  bit  binary  representation  is 
used,  e.g.,  1011,  then  light  will  be  *on*  from  lasers  at  Xq,  and  X3, 
and  "off"  from  the  laser  at  X^.  The  fiber  output  is  then  collimated 
(via  lens  L^)  and  expanded  (via  lenses  Lg.L^)  along  the  y-dimension. 
Along  the  x-dimension  the  light  is  focused.  The  so-created  ■pencil1 
beam  illuminates  a  4-channel  Acousto-Optic  cell.  Each  of  the  4  channels 
of  the  AO  cell  has  a  value  proportional  to  B. .  Thus  the  bottom  channel 
gets  the  Bq  value,  the  next  channel  the  B^,  etc.  If  RqjB^^.Bj  is 
binary,  e.g.,  0110,  then  we  will  have  the  two  middle  channels  *on*  and 
the  bottom  and  top  channels  "off*.  The  AO  device  is  followed  by  a  prism 
and  a  cylindrical  lens.  The  prism/lens  set  accomplishes  the  following: 
(1)  wavelength  demultiplexing  and  (2)  focusing  of  the  16  spots.  Thus, 
the  back  focal  plane  of  we  obtain  16  spots  in  a  4x4  format.  Assuming 
that  then  the  values  in  the  set  of  spots  are: 
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Close  inspection  of  the  values  of  these  spots  reveals  that  we  have 
formed  in  parallel  all  A.Bj ,  i,j  =  0,1, 2, 3,  products  necessary  for  the 
formation  of  the  7  bits  (see  Eq.  (3.7)).  Next  we  need  to  add  these 
products  properly  in  order  to  obtain  the  7  convolution  points  pQ-Pg. 

The  additions  can  be  accomplished  in  a  variety  of  ways: 

(a)  The  use  of  detectors  with  specific  area  shapes  (e.g.,  see 
Figure  4.8).  The  shapes  are  such  that  the  proper  products  are 
added  instantly.  This  is  shown  in  detail  in  Figure  4.8. 

Comparison  of  the  results  of  Figure  4.8  and  those  of  the  previous 
Section  shows  that  we  indeed  obtain  the  correct  convolution  points. 

(b)  The  use  of  16  fibers  each  of  which  collects  light  from  a  particular 
spot.  The  proper  fibers  are  combined  onto  a  single  element 
detector  which  adds  the  light  leaving  the  fibers’  output.  In  this 
case  we  need  16  fibers  and  7  detectors. 

(c)  The  use  of  a  cylindrical  lens  which  is  set  at  45*  with  respect  to 
the  y-dimension.  This  orientation  of  the  lens  results  in  a 
collapse  of  the  data,  which  are  then  read-out  by  7  detectors. 


Note  that  all  three  techniques  are  simple  and  equivalent.  The  choice 
depends  on  the  particular  design,  processor  size,  etc.  Finally,  note 
that  the  4-channel  AO  device  can  be  replaced  by  a  single-channel  device 
in  conjunction  with  frequency  multiplexing.  In  this  cue,  data 

will  be  at  frequencies  *o’*l’*2’*3‘  Note  however  that  the 
AO  device  linearity  requirements  are  increased  in  order  to  avoid 
spurious  effects  due  to  the  presence  of  4  RF  frequencies. 

4.6  BP AM  AO  Processor  Characteristics 


The  BPAM  AO  processor  allows  us  to  perform  a  digital  multiplication 
every  clock  cycle  (not  every  2N-1  cycles  as  in  DMAC) .  This  technique 
fully  utilizes  the  inherent  speed  of  optics.  In  principle,  the  system 
can  multiply  (and  if  we  desire  accumulate)  binary  numbers  as  fast  as  the 
lasers  or  AO  cells  can  be  switched.  State-of-the-art  lasers  can  be 
switched  every  0.3  nsec  (this  translates  to  a  throughput  rate  of 
3  GOPS) .  For  a  practical  scenario,  however,  the  limiting  factor  will  be 
the  speed  with  which  the  AO  cell  input  data  can  be  provided.  With 
currently  available  AO  technology,  a  miniaturized  unit  of  the  processor 
of  Figure  4.7  could  be  operated  at  250  MHs  (see  Section  4.7).  In  this 

g 

case  the  system’s  throughput  rate  is  0.25  x  10  M-A/sec. 

The  number  of  wavelengths  is  equal  to  the  number  of  bits  in  the 
input.  To  avoid  an  extensive  number  of  X^  (and  AO  channels)  we  use  a 
base  system  higher  than  2.  A  good  choice  is  base  4  with  4  digits.  Then 
we  obtain  8  input  binary  bits  and  16  output  binary  bits.  In  this  case 
we  need  4  wavelengths,  4  AO  channels,  7  detectors,  and  7  A/D  converters. 
Note  that  in  this  case  we  also  need  4  input  light  levels;  that  is,  0,1,2 
and  3. 

To  avoid  extensive  A/D  requirements,  we  use  time  integration  for 
summation  of  different  number  products  (i.e.,  creation  of  inner  products 
or  vector-vector  multiplication) .  The  time  integration  can  be 


accomplished  if  each  detector  is  followed  by  a  charge  integrator.  For 

clock  speeds  of  250  MHz,  and  for  summation  of  7  products,  the  A/D 

g 

requirements  are  7  A/D’s  with  8  bits  each  at  29  MHz. 

4.7  BP AM  AO  Response  Analysis 

In  this  Section  we  address  the  speed  of  the  BPAM  AO  processor  and,  in 

particular,  assess  the  ultimate  speed  limit  of  a  BPAM- based  AO 

processor.  There  are  three  key  components  in  the  system  of  Figure  4.7: 

(1)  laser  diodes,  (2)  AO  cell  and  (3)  output  detectors  and  A/D’s.  The 

speed  limit  of  the  first  component  is  usually  in  the  range  of  a  few  GHz. 

For  example,  the  LD53-0MF  laser  diode  manufactured  by  ORTEL  Corporation 

has  a  modulation  bandwidth  of  6  GHz.  This  translates  to  ON/OFF  pulses 

of  0.167  nsec  which  in  turn  translates  to  data  rates  of  1  bit/ (2  x  0.167 

nsec)  -  3  Gb/s,  assuming  RZ  data  format.  Thus,  this  component  allows 

g 

throughput  rates  of  the  order  of  3  x  10  M-A/sec.  The  speed  limit  of 
the  last  component  (detectors  and  A/D*s)  depends  on  whether  we  use 
time-integration  (i.e.,  detection  of  inner  products  instead  of  single 
products) .  For  most  applications  of  interest,  we  require  vector-matrix 
multiplication  or  matrix-matrix  multiplication.  This  implies  that  the 
desired  outputs  are  inner  products,  which  in  turn  implies  that 
time-integration  is  preferable.  For  these  applications,  the  A/D  speed 
requirement  is  about  1/time  for  inner  product  formation.  Assuming  that 
we  are  forming  inner  products  that  consist  of  32  or  more  products,  and 
we  allow  0.167  nsec  per  product  and  0.167  nsec  per  zero  (each  product  is 
followed  by  a  zero) ,  we  find  that  the  time  required  for  the  formation  of 
an  inner  product  is  32x  (0.167  nsec  +  0.167  nsec)  -  10  nsec.  Thus  the 
A/D  speed  requirement  does  not  exceed  100  MHz  which  is  within  the 
capabilities  of  the  present  A/D  technology. 


Let  us  now  examine  the  second  component,  namely,  the  AO  cell.  It  can  be 
shown^  that  the  modulation  bandwidth  of  associated  with  a  response 
falloff  (in  dB)  of  6,  is 


0.7  f^U 

if  =  — V-3-  ,  (4.5) 

A  o 

where  U  is  the  speed  of  sound  in  the  AO  crystal,  n.  is  the  refractive 

3  A 

index  of  the  AO  material,  and  W  is  the  diameter  of  the  input  laser  beam 

-2  .  °  3 

at  the  e  points.  Let  us  assume  that  (i  =  3dB,  U  -  6.32  x  10  m/sec, 

s 

and  n^  =  3.31,  which  corresponds  to  a  scenario  with  a  GaP  AO  cell  (which 
is  one  of  the  "fastest*  AO  cell  materials) .  In  this  case  Equation  4.5 
becomes : 


2.31  x  10^  m/sec 
W 


(4- 


To  calculate  Af  and,  thus,  the  AO  cell’s  "speed  of  response,"  we  need  to 

know  the  laser  beam  diameter  W  .  The  diameter  W  of  the  focused  laser 

o  o 

beam  is  given  by: 


W 


o 


(4.7 


where  S  is  the  angular  spot  radius  and  f^  is  the  local  length  of  the 
lens  used  to  focus  the  laser  beam.  The  angular  spot  radius  S  is  given 
by 


S  =  A  K(No) (D/fL)3 


1.22  X 
n  D 


(4.8 


where  D  is  the  beam  diameter  incident  on  the  lens,  n  is  the  index  of 
refraction  in  air  and  K(Nq)  is  an  explicit  function  of  index  ratio 
Nq  =  n’/n  (n1  is  the  lens  material  index  of  refraction)  and  lens  shape. 
In  Equation  4.8,  the  first  term  represents  the  contribution  of  spherical 
aberrations  to  the  beam  size  W  .  The  second  term  represents  the 
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contribution  of  diffraction  effects  to  the  beam  size  Wq.  Clearly,  we 

wish  to  optimize  the  value  of  D  to  give  the  minimum  value  of  S.  This 

■« 

optimization  has  been  carried  out  numerically  as  a  function  of  f^  for 
the  case  of  a  piano-spherical  lens  which  has  the  smallest  K(Nq)  given  by 


K(No)  = 


32(No-l) 


hra  <No2  -  2No  +  2/No> 


(4.9) 


As  fixed  data  we  use:  X  =  780  fim  (corresponding  to  a  high-speed  AlGaAs 
laser  diode),  Nq  =  1.51108  (corresponding  to  EK-7  lens  material).  The 
results  of  the  calculations  are  given  in  Table  4.1  which  shows  the  spot 
size  Wq  as  a  function  of  focal  length  f^  together  with  the  optimum 
D,Dopt»  for  minimization  of  Bq.  4.8. 


Table  4.1  I  as  a  Function  of  fT 
o  L 


f^  (mm) 

Dopt(mm) 

Wo0*») 

1.00 

0.31 

4.14 

2.00 

0.52 

4.92 

3.00 

0.70 

5.44 

4.00 

0.87 

5.85 

5.00 

1.03 

6.18 

6.00 

1.19 

6.47 

7.00 

1.33 

6.73 

8.00 

1.47 

8.95 

9.00 

1.81 

7.18 

10.00 

1.74 

7.35 

The  smallest  possible  focal  length  that  we  can  hope  to  use  with  a 
state-of-the-art  GaP  AO  cell  is  2  mm,  dictated  by  the  crystal  width 
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necessary  to  provide  uniformity  in  the  lapping  process  of  the  AO  cell’s 

transducer.  From  Table  4.1  we  see  that  for  f,  =  2  mm.  the  diameter  W 

L  o 

of  the  focused  spot  is  4.02  pm.  Substituting  for  Wq  =  4.92  fim  in 
Eq.  4.6,  we  find  that  the  modulation  bandwidth  Af  of  the  AO  cell  cannot 
exceed  470  kHz.  Thus,  each  bit  in  the  AO  cell  will  be  represented  by  a 
pulse  that  has  a  width  of  -  2  nsec.  This  corresponds  to  a  data  rate 
(assuming  &Z  data  format)  of 

Data  rate  =  1/(2  +  2  nsec)  -  250  kb/s  (4.10) 

In  conclusion,  we  find  that  the  component  that  sets  the  limit,  in  the 
BPAk  AO  system,  is  the  AO  cell.  We  also  find  that,  in  the  best  case, 
the  data  rate  in  the  AO  cell  is  -  250  kb/s  which  corresponds  to  a 

ft 

throughput  rate  for  the  BPAk  AO  system  of  250  x  10°  k-A/sec. 


m 


5.  CIRCULARLY  POLARIZING  SAMPLING  TECHNIQUE  FOR  COMPLEX  MATRIX  OPERATION 


In  the  previous  Section  we  have  described  Acousto-Optic  architectures 
which  can  be  used  for  performing  matrix  multiplication  operations. 

Matrix  multiplication  is  the  most  essential  and  major  operation  for  our 
applications  of  interest.  So  far,  the  elements  in  those  matrices  have 
been  treated  as  real  numbers.  In  reality,  however,  the  elements  are 
generally  complex  numbers.  This  is  an  unavoidable  situation,  especially 
when  the  original  signals  are  acquired  from  heterodyning  processes  that 
yield  the  quadrature  pair.  Therefore  it  is  very  important  to  provide  an 
efficient  method  of  handling  complex  numbers  in  order  to  achieve 
efficient  matrix  multiplication  operation.  There  are  two  conventional 
techniques  to  accommodate  complex  numbers;  both  have  serious  problems. 
The  first  method  is  a  well-known  software  solution  that  realizes  the 
complex  multiplication  by  first  decomposing  the  operation  into  four 
independent  multiplications  of  real  matrices  (Figure  5.1)  to  form  real  x 
real,  real  x  imaginary,  imaginary  x  real,  and  imaginary  x  imaginary 
terms,  and  later  synthesizes  the  real  part  and  the  imaginary  part  of  the 
final  product  by  summing  the  square  terms  and  the  cross  terms, 
respectively.  This  method  allows  us  to  utilize  a  matrix  multiplier  for 
real  numbers  without  any  modification  in  the  hardware.  The  disadvantage 
of  this  method  is  the  slow  speed  which  results  from  the  necessity  of 
reading  the  buffer  memory  (which  contains  the  real  and  imaginary  parts) 
four  times.  Thus,  it  takes  four  times  longer  than  the  multiplication  of 
real  matrices. 

The  second  method  is  a  hardware  solution.  It  requires  a  major 
modification  of  the  multiplier  cells  to  accommodate  complex  numbers 
(Figure  5.2).  Each  cell  must  contain  four  multipliers  and  two  adders 
and  must  compute  the  four  terms  in  parallel.  The  problem  with  this 
method  is  the  tremendous  complexity  of  its  hardware.  It  requires  that 
the  size  of  each  cell  be  increased  by  a  factor  of  four  and  that  the 
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Result 

Problem 

Slow  Speed 

The  Correlator  is  Required  to  Read  the  Memory  4  Times. 
Then  it  Needs  Two  More  Adding  Operation  Steps  to  Obtain 
the  Complex  Pair. 


Figure  5.1  Multiplication  of  Complex-Valued  Matrices  Using  Real-Valued 
Matrix  Multiplier 
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number  of  the  input  ports  be  doubled.  It  is  very  difficult  to  realize 
such  hardware,  especially  in  the  AO  architectures. 

In  the  following  paragraphs  we  present  a  unique  solution  that  is  able  to 
maintain  relatively  high  speed  and  also  hardware  simplicity  at  the  same 
time. 

A  complex  signal  carries  more  information  than  a  real  signal. 

Therefore,  the  manipulation  of  such  signals  becomes  more  computation 
intensive.  However,  we  can  improve  the  data  handling  significantly  by 
arranging  the  input  data  in  the  most  efficient  way.  One  major 
complexity  in  handling  the  complex  matrix  multiplication  is  the  duality: 
every  sampled  data  point  comes  in  the  form  of  a  pair,  composed  of  the 
real  part  time  series  and  the  imaginary  part  time  series.  Our  approach 
solves  the  complexity  problem  by  forming  a  composite  but  single-time 
series  that  is  capable  of  representing  both  the  real  and  imaginary 
parts.  Thus,  the  elements  of  the  matrix  have  a  single  number  instead  of 
a  real  and  imaginary  pair.  This  arrangement  can  simplify  the  matrix 
multiplication  operation  very  significantly.  The  approach  consists  of 
two  steps:  (1)  a  special  sampling  process  and  (2)  the  use  of  such 
samples  in  the  matrix  multiplication. 

5.1  Circularly  Polarizing  (CP)  Sampling 

A  typical  data  acquisition  process  is  depicted  in  Figure  5.3.  First, 

the  original  signal  is  received  in  the  antenna  and  mixed  with  the  local 

oscillator  of  the  target  frequency  by  using  an  ordinary  heterodyne 

process  that  yields  an  analog  quadrature  pair  representing  the  real  part 

and  the  imaginary  part.  In  the  conventional  technique  those  signals  are 

digitized  by  A/D  converters  which  are  strobed  by  the  same  clock  to 

generate  a  digital  quadrature  pair  iu  every  sampling  period,  t  .  In  the 

s 

present  CP  sampling  scheme,  the  two  A/D  converters  are  strobed  by  clocks 
having  the  same  frequency  but  phase  shifted  by  180*.  Also  the  sign  of 
the  sample  in  each  quadrature  is  alternated  at  every  sample.  The 
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Figure  5.3  Circularly  Polarizing  Sampling  Scheme 


digitized  results  are  interlaced  to  fora  a  single  string  of  sampled  data 

with  the  interval  (1/2)  t  .  This  composite  time  series  has  either  a 

progressive  quadrature  (1, j ,-1,-j , . . .)  or  a  regressive  quadrature 

(1,-j ,-1, j , . . »  depending  on  the  phase  of  the  polarity  alternation.  We 

call  the  former  the  right-hand  circularly  polarizing  (KCP)  sampling  and 

the  other  the  left-hand  circularly  polarizing  (LCP)  sampling,  indicating 

the  rotational  orientation  of  the  phasor.  Appendix  A  shows  that  either 

of  the  CP  sampling  schemes  is  capable  of  representing  both  real  and 

imaginary  values  of  the  original  signal  without  loss  of  information  and 

free  from  the  aliasing  problem  as  long  as  the  sampling  frequency  (1/t  ) 

s 

is  greater  than  the  bandwidth  of  the  original  signal.  It  also  shows 
that  RCP  and  LCP  sampled  signals  are  conjugate  to  each  other.  The 
bandwidth  requirement  is  identical  to  that  of  the  conventional  sampling 
case;  namely,  the  Nyquist  criterion.  The  obvious  advantage  of  this  data 
representation  is  simplicity.  We  now  show  the  use  of  the  CP  sampled 
data  in  performing  matrix  multiplication. 

5.2  Matrix  Multiplication  Using  CP  Sampled  Data 

We  st  discuss  a  correlation  operation  as  an  example  of  the  simplest 
case  of  matrix-vector  multiplication.  Later  we  address  the  more  general 
case. 

Consider  a  correlation  operation  function  y(r)  for  signals  h(t)  and  x(t) 
where 

y(r)  =  /  h* (t-r)  x  (t)  dt  (5.1) 


and  h(t),  x(t),  and  y(r)  are  continuous  complex  functions  and  h*(t)  is 
conjugate  to  h(t) .  We  can  translate  the  situation  to  the  discrete  CP 
sample  domain  by 


where  y,  h,  and  x  are  CP  sample  functions  and  t  and  r  are  integers. 

Suppose  the  correlating  function  h^.  has  finite  duration,  say,  the 
duration  of  four  data  points.  Then  we  can  express  the  correlation 
operation  in  the  form  of  a  multiplication  between  the  band  matrix  H  with 
bandwidth  4  and  the  rector  X  (see  Figure  5.4).  Now  the  input  function 
x^.  is  a  CP  sampled  function  carrying  the  sequential  quadrature 
information  (1, j , . . .) ,  and  the  correlating  function  h*(t)  is  the 
conjugate  of  h. .  Therefore,  it  has  the  corresponding  LCP  quadrature 
sequence  (1,-j ,-l, j , . . .) .  The  first  element  of  the  output  rector  y^  is 
the  result  of  the  inner  product  between  the  first  row  of  the  matrix  H 
and  the  rector  X,  or 

1st  element:  (hj)  (X^  +  Hhg)  (jX^-hg)  (-XgMjh^  (-jX4)  .  (5.3) 

Note  that  the  phase  of  all  the  product  terms  turns  out  to  be  0* . 

Therefore,  the  first  element  can  be  expressed  as: 

yl  =  ^1*1  +  ^2^2  +  ^3*3  +  1*4X4  (5-4) 

It  is  clear  that  this  inner  product  calculation  requires  only  the 
multiply/add  capability  of  real  numbers.  The  second  element  in  the 
output-  rector  is  calculated  similarly. 

2nd  element:  (h^  (jXg)  ♦  (-jh2)(-Xg)  *  (-hg)  (-jX4)  ♦  (jh4) (Xg)  (5.5) 

In  this  case,  the  phase  of  the  entire  term  is  a  constant  90*.  Thus,  it 
is  appropriate  to  represent  the  second  element  as: 

jy 2  =  j  ^1*2  *  **2*3  *  ^3*4  *  **4*5)  (5.6) 
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The  phase  of  the  product  function  between  a  row  of  H  matrix  and 
the  X  column  rector  is  always  constant;  thus  real  valued 
multiply-add  capability  is  sufficient  to  calculate  the  output 
vector  Y . 


Pigure  S.4  Correlation  Operation  Using  CP  Sampling  Approach 


so  that  the  value  y2  still  remains  real.  In  a  similar  way,  we  can 
obtain  the  third  and  fourth  elements: 

-y3  =  "  (^1*3  +  ^2^4  +  ^3*5  +  ^Xg)  (5‘ 

-jy4  =  -j(h1x4  -  h2x5  ♦  h3x6  +  h4Xy)  (5. 

It  is  clear  that  the  rest  of  the  elements  can  be  obtained  in  the  same 
way.  In  any  of  these  inner  product  calculations,  none  of  the  complex¬ 
valued  multiplication  capability  is  needed,  owing  to  the  CP  sampling. 

The  output  vector  is  again  in  the  form  of  CP  samples. 

Thus,  we  see  that  once  we  represent  the  input  complex  signal  in  CP 
sample  form,  the  output  is  also  in  CP  sample  form,  and  we  can  carry  out 

the  entire  operation  in  the  CP  sample  domain.  The  method  treats  all  the 

numbers  as  real  numbers  and  the  quadrature  information  is  coded  in  the 
data  position  itself. 

In  the  case  of  the  more  general  shift  variant  system,  the  matrix  is  not 
necessarily  banded  and  there  is  no  repetition  of  the  same  series  from 
row  to  row  as  is  the  case  in  correlation  or  convolution  operation. 
However,  the  matrix-vector  multiplication  will  still  be  accomplished  in 
the  same  manner  as  long  as  the  data  position  represents  the  proper 
quadrature.  An  example  of  positional  coding  of  phase  is  shown  in 
Figure  5.5.  An  interesting  characteristic  of  the  matrix  is  that  the 
quadrature  is  progressing  along  the  two-dimensional  matrix  and  constant 
along  the  diagonal  orientation.  As  long  as  this  quadrature  based 
pattern  is  used,  we  can  perform  multiplication  of  any  matrices,  using 
only  real  numbers. 

In  summary,  we  believe  that  this  new  CP  sampling  approach  solves,  to  a 
great  extent,  the  problems  of  the  matrix-vector  multiplication  for 


complex  numbers  in  a  very  systematic  way.  It  is  important  to  note  that 
the  total  operation  is  carried  out  in  the  CP  sample  domain,  and  that  it 

l 

gives  us  the  freedom  to  cascade  such  operations,  or  the  freedom  to 
iterate  the  operation  by  feeding  back  the  result  of  one  multiplication 
to  the  next  without  an  intermediate  domain  conversion  process. 


PERFORMANCE  OF  HIGH  ACCURACY  ACOUSTO-OPTIC  PROCESSORS 


In  this  Section  we  consider  the  performance  of  the  high  accuracy  AO 
processors  which  we  hare  developed  (Section  4) .  It  is  of  interest  to 
compare  these  systems  with  typical,  state-of-the-art  electronic  devices. 
In  describing  system  performance,  one  measure  that  is  usually  employed 
is  the  throughput  rate  (TR) .  However,  when  comparing  different  systems, 
it  is  also  of  interest  to  examine  the  efficiency  of  the  system  (SE) , 
defined  as  the  throughput  rate  per  unit  power,  along  with  the  net 
multiplication  speed  (MS) .  The  efficiency  measure  is  important  because 
it  demonstrates  the  power  consumption  necessary  for  a  given  TR.  The 
multiplication  speed  measure  is  useful  because  in  certain  applications 
high  MS  rather  than  high  TR  is  important. 

The  comparison  analysis  is  done  by  calculating  the  SE  and  MS  figures  for 
various  families  of  state-of-the-art  electronic  multiplier 
accumulators  as  well  as  for  typical  space-integrating  and/or  time- 
integrating  AO  processors  that  employ  DMAC  or  BPAM.  These  calculations 
are  used  as  the  basis  for  a  simple,  first-order,  comparison  which  gives 
a  clear  picture  of  the  computational  competitiveness  of  AO  processors. 

6.1  Performance  of  Electronic  Multipliers 


Current,  state-of-the-art,  electronic  competition  comes  from  three 
families  of  electronic  integrated  circuits:  (a)  CMOS,  (b)  high-speed, 
Silicon-based,  VLSI,  and  (c)  GaAs. 

In  the  first  category  we  have  a  variety  of  commercially  available 

multiplier/accumulator  chips  such  as:  (a)  the  Toshiba  T6354  16  input 

bit,  32  output  bit  (16/32  bit)  chip  and  (b)  the  Logic  Devices  LMA  1009-1 

16/32  bit  device.  The  first  device  has  an  MS  of  10  MHs  and  a  power 

consumption  of  100  mW.  This  corresponds  to  an  SE  of  100  x 
6 

10  M-A/sec.W.  The  second  device  has  an  MS  of  15  MHz  with  a  power 


consumption  of  125  ml.  This  corresponds  to  an  SB  of  120  x 
10®  M-A/sec.f . 

In  the  second  category  we  have  various  high-speed  YLSI  devices  which  are 
usually  custom  made  for  specific  signal  processing  applications  and  can 
contain  a  large  number  of  multiplier/accumulator  units,  festinghouse’s 
typical  high-speed  8/16  bit  VLSI  is  capable  of  an  MS  of  30  MBs  while 
consuming  about  250  ml.  This  corresponds  to  an  SB  of  120  x 

105  M-A/sec.f. 

Finally,  in  the  third  category,  we  have  a  number  of  GaAs  LSI  devices  an 
example  of  which  is  the  Rockwell**  8/16  bit  multiplier.  This  device 
forms  the  16  bit  product  in  5.25  nsec  which  corresponds  to  an  MS  of  1B0 
MHz.  Power  dissipation  is  about  1.4V.  Thus,  the  device’s  SB  is  135  x 

106  M-A/sec.f. 

The  results  of  the  above  calculations  are  compiled  in  Table  6.1.  From 
this  table  we  see  that  although  the  MS  varies  from  family  to  family 
(10,30,100  MHz,  respectively)  the  SB  remains  about  the  same  (~  120  x 
106  M-A/sec.f) . 

6.2  Performance  of  DMAC  Based  AO  Processors 

fe  begin  the  analysis  of  AO  processors  by  considering  DMAC  based 
systems,  fe  first  examine  the  performance  of  the  space- integrating 
single  detector  AO  system  of  Section  4.1.  fe  choose  to  consider  this 
system  first  because  it  represents  a  typical  example  of  the  ability  of 
optics  not  only  to  multiply  but  also  to  compress  (i.e.,  add)  data. 

Let  us  assume  that  the  input  accuracy  is  8  bits.  In  this  case  and  for 
M  s  32  the  maximum  value  of  the  output  convolution  is  8  z  32  *  256, 
which  requires  an  8-bit  A/D.  State-of-the-art  (chip  level)  8-bit  A/D’s 
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operate  at  30  MHz  and  consume  120  af  of  power  .  Use  of  3  such  A/D’s  in 
conjunction  with  an  electronic  data  deflection  scheme  allows  a 


conversion  rate  of  90  MHz  at  a  power  consumption  of  about  500  mW.  Such 
a  rate  allows  data  periods  of  about  11  nsec.  The  convolution  operation 
is  completed  in  2N-1  =  15  cycles  and  thus  the  US  is  8  MHz  and  the  TR  is 
192  x  10  M-A/sec.  For  the  purposes  of  this  analysis  we  assume  that  the 
bulk  part  of  the  power  consumption  comes  from  the  laser  and  the  A/D’s. 

A  detailed  analysis  shows  that  the  laser  consumption  is  about  5  W. 


Thus,  the  total  power  consumption  is  5.5  V  and  the  SB  is  35  x 
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10  M-A/sec. W.  This  is  a  rather  poor  figure  and  is  partially  due  to  the 
low  MS  figure.  To  improve  this  figure  let  us  assume  that  15  A/D’s  are 
used  so  that  effective  conversion  rates  of  450  MHz  are  achievable.  In 
this  case  we  can  use  clock  frequencies  of  900  MHz  and  have  an  MS  of 
30  MHz,  a  TR  of  960  x  10®  M-A/sec  and  an  SE  of  128  x  106  M-A/sec. W. 
Comparison  of  these  figures  with  those  of  the  electronic  cases,  however, 
shows  that  the  AO  system  does  not  offer  any  significant  SE  or  MS 
advantage . 

We  now  examine  a  different  AO  system  architecture,  specifically  an 
array,  non-compressive  processor.  This  system  is  similar  to  the  one  of 
Figure  4.2,  but  it  uses  a  cylindrical  lens  (with  power  along  x)  in 
conjunction  with  M  detectors  for  the  individual  computation  of  the  M 
products.  In  this  case,  for  an  8/16  bit  system,  the  maximum  convolution 
value  is  8  which  requires  a  3-bit  A/D.  We  assume  that  a  set  of  8-level 
comparators  will  be  used  because  of  the  relatively  low  resolution 
required.  For  8-level  comparison,  3-dual  level  comparators  are  needed 
along  with  some  dedicated  logic.  A  typical  example  of  a  high-speed 
comparator  is  the  Advance  Micro  Devices  AM8687  which  allows  us  to  build 
an  8-level  comparator  circuit  which  operates  at  about  300  MHz  and  has  a 
power  consumption  of  about  800  mW.  This  allows  a  clock  frequency  of 
600  MHz  which  translates  to  an  MS  figure  of  20  MHz.  Thus  for  M  s  32, 
the  system’s  TR  is  640  x  10®  M-A/sec.  To  calculate  the  SE  figure  we 
need  to  calculate  the  laser  power.  If  we  use  one  laser  diode  per  AO 
channel  we  find  that  5  mW,  20%  efficient,  laser  diodes  are  sufficient. 


Table  6.1 


In  this  case,  the  total  laser  diode  power  consumption  is  25  x  32  a f  = 

800  mV  and  the  total  power  consumption  26  V.  This  gives  an  SE  figure  of 
23  x  10®  M-A/sec.W.  Comparison  of  the  system’s  performance  figures  with 
those  of  the  electronic  counterparts  shows  that  the  AO  system,  once 
again,  does  not  offer  any  significant  advantage. 

Ve  now  examine  the  time-space  integrating  systolic  AO  processor  of 
Section  4.3. 

For  minimum  detection  errors,  and  with  state-of-the-art  components,  a 
8/16  bit  system  with  N  =  K  =  16  and  M  =  8  is  realistic.  The  TR  of  such  a 
system  depends  on  the  data  loading  time.  Currently  available  AO  cells 
allow  for  bit  widths  of  2  nsec.  Thus,  each  multiplication  requires 

2  x  8  x  (2  +  2)  r.^ec  =  64  nsec.  This  corresponds  to  an  MS  of  15.6  MHz 

A  Q 

and  a  TR  of  8  x  8  x  15.6  x  10  M-A/sec  or  10  M- A/sec.  To  calculate  the 
power  consumption  we  take  into  account  the  laser  diodes  and  the  A/D’s 
only.  An  8  x  16  laser  diode  array  (with  5  mV,  20%  efficient  diodes) 
requires  an  average  power  of  1.6  V.  The  power  consumption  of  the  A/D’s 
depends  on  the  number  of  A/D’s  used  and  the  operating  frequency.  For 
full  use  of  the  system’s  MS  we  need  to  use  a  combination  of  serial  and 
parallel  detector  read  out.  Vith  such  a  scheme  we  need  64  7-bit  A/D’s, 
32  6-bit  A/D’s,  16  5-bit  A/D’s,  etc.,  that  operate  at  16  MHz.  If  we  use 
the  30  MHz  120  mV  8-bit  A/D’s  we  can  replace  the  64  7-bit  A/D's  (i.e., 
two  elements  per  A/D) .  The  total  power  consumption  of  these  devices  is 
3.84  V.  This  figure  represents  M  70%  of  the  total  A/D  power 
consumption.  Thus,  for  our  purposes,  the  total  power  consumption  is 
"  7.1  I  and  gives  an  SE  of  141  x  10®  M-A/sec.W.  Comparing  these  figures 
with  those  of  the  electronic  competition  (Table  6.1)  we  find  that  the  AO 
system  does  not  offer  any  significant  performance  advantage. 

In  conclusion,  we  see  that  the  DMAC-based  architectures  do  not  offer  any 
significant  performance  advantages  (Table  6.2)  when  compared  with 


6-6 


Table  6.2 

TYPICAL  PERFORMANCE  OF  DMAC  AO  SYSTEMS 


TYPE 

I/O 

(BITS) 

MS 

(MHz) 

POWER 

00 

SEC 

(M-A/sec  W) 

SPACE  INTEGRATING 
(1  Detector,  3  A/D’s) 

8/16 

6 

5.5 

35  x  106 

SPACE  INTEGRATING 
(1  Detector,  15  A/D’s) 

8/16 

30 

6.8 

128  x  106 

SPACE  INTEGRATING 

(32  Detectors,  32  Comparators) 

8/16 

20 

28 

22.7  x  106 

TIME/SPACE  INTEGRATING 

8/16 

15.6 

7.1 

141  x  106 

their  electronic  counterparts  (Table  6.1).  Similar  performance  is 
characteristic  of  other  architectures  that  have  appeared  in  the  open 
literature  but  for  reasons  of  space  have  not  been  included  in  this 
Section.  There  are  two  main  factors  which  limit  this  performance. 

First,  the  algorithm  itself  takes  2N-1  cycles  to  complete  the 
convolution.  This  results  in  the  requirement  of  at  least  one  bit-serial 
propagation  and  thereby  substantially  reduces  the  TR  (which  in  principle 
can  be  high) .  Second,  optics  performs  only  part  of  the  full, 
high-accuracy  multiplication,  namely,  the  convolution,  and  subsequently 
requires  the  *help"  of  power-consuming  electronics  (e.g. ,  A/D’s)  to 
complete  the  operation.  This  results  in  an  increased  power  consumption 
and  a  decreased  SE. 

Based  on  the  above  observations  we  conclude  that  in  order  to  improve  the 
MS  and  SE  figures,  we  need  to  decrease  the  time  required  for  completion 
of  the  convolution  and  eliminate  the  A/D’s. 

6.3  Performance  of  BPAM  Based  AO  Processors 

Since  the  BPAM  operation  requires  one  clock  cycle,  the  system  TR  is 
equal  to  the  speed  with  which  we  can  address  the  AO  cell.  In 
Section  4.7  we  showed  that  at  best  the  AO  device  cam  be  operated  at 
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about  250  MHz  which  corresponds  to  a  TR  of  250  x  10  M-A/sec.  and  an  MS 
of  250  MHz.  The  power  consumption  of  the  system  depends  on:  (a)  the 
number  of  bits  in  the  input  and  (b)  the  number  of  A/D’s  used.  To  avoid 
an  extensive  number  of  wavelengths  we  need  to  use  a  bame  system  higher 
than  2.  If  we  use  base  4  with  4  digits  then  we  have  a  processor  of 
8/16  bits.  In  this  case  we  need  4  laser  diodes,  7  detectors  and 
7  A/D’s.  If  we  use  the  8-bit  30  MHz  120  mW  A/D’s,  then  we  cam  read  out 
the  output  at  a  rate  of  29  MHz.  Each  such  output  will  correspond  to  an 
inner  product  which  is  the  sum  of  8  number  products  (the  summation  is 
performed  in  time,  at  the  detectors).  In  this  case  the  A/D  power 
consumption  is  7  x  120  mW  =  840  mW.  Adding  to  this  figure  the  power 


consumption  of  the  laser  diodes  (4  x  25  mW  =  100  mW)  we  find  that  the 
power  consumption  of  the  system  is  940  mW.  This  corresponds  to  an  SE  of 
286  x  10®  M-A/sec.W. 

Comparing  the  MS  and  SE  figures  of  the  BP AM-based  AO  system  with  those 
of  the  DMAC  based  AO  systems  (Table  6.2),  we  find  that  the  MS  figure  of 
the  BPAM  system  is  higher  by  about  an  order  of  magnitude.  This 
improvement  is  expected  since  the  time  requires  for  the  BPAM  is  reduced 
by  a  factor  of  2N-1.  On  the  other  hand,  because  MS  is  increased,  one 
might  expect  SE  to  increase  proportionally.  This  is  not  the  case, 
however,  because  SE  is  increased  only  by  a  factor  of  2.  This  is  due  to 
the  fact  that  BPAM  requires  2N-1  output  detectors  and  A/D’s  for  the 
detection/conversion  of  a  single  product.  This  translates  to  additional 
power  consumption  which  partially  offsets  the  MS  improvement. 

If  we  now  compare  the  performance  of  the  BPAM-based  AO  system  with  that 
of  the  electronic  competition  (Table  8.1),  we  find  that  the  BPAM  system 
has  a  MS  figure  that  is  higher  by  about  an  order  of  magnitude  than  that 
of  Silicon-based  devices.  However,  this  advantage  essentially 
disappears  when  compared  with  GaAs  devices.  Thus,  once  again,  we  find 
that  for  all  practical  purposes  the  AO  systems  do  not  offer  any 
significant  performance  advantage  (e.g.,  an  order  of  magnitude 
improvement  in  SE  and/or  MS).  One  of  the  main  reasons  for  this  behavior 
is  the  fact  that  the  available  algorithms  (both  DMAC  and  BPAM)  require 
power-hungry  post-detection  electronics  (i.e.,  A/D’s  and  comparators) 
for  conversion  of  a  multi-level  analog  signal  to  a  binary  output. 

A  method  of  overcoming  this  problem,  is  to  eliminate  the  presence  of 

analog  signals  by  employing  binary  valued  A^,  B^,  i=0,  ...,  N-l  input 

levels.  In  this  case  the  various  products  AJJ^  can  take  only  two 

values,  0  and  1.  This  implies  that,  in  the  absence  of  an  optically 
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implemented  product  summation,  the  N  detectors  (one  for  each  product) 
simply  detect  the  presence  or  absence  of  light.  Subsequently,  their 


binary  outputs  are  used  to  directly  drive  pulse-counting  electronics 
(e.g.,  counters)  and  not  A/D’s  or  comparators .  Once  these  devices  count 
M  such  pulses  (for  an  M-element  inner  product)  they  drive  properly 
arranged  digital  adders  that  perform  the  A^B^  product  summations.  The 
outputs  from  the  adders  subsequently  drive  a  shift-register/accumulator 
that  performs  the  final  weighting  and  summation  operations.  An  example 
of  such  an  electronic  arrangement  is  shown  in  Figure  6.1  for  the  cause  of 
a  2-bit  multiplier. 

In  optimizing  such  an  architecture,  it  becomes  immediately  apparent 

that,  aside  from  its  delay  properties,  the  AO  cell  is  used  as  a  simple 

optical  switch.  In  fact,  to  provide  this  switch  function,  the  AO  cell 

is  unnecessarily  complex  because:  (a)  it  requires  RF  carriers,  mixers 

and  amplifiers,  (b)  it  requires  a  rather  complicated  optical  system,  auid 

(c)  it  requires  an  8-channel  optical  multiplexer  (for  a  8/16  bit 

system)  .  A  fair  simpler  and  faster  system  cam  be  realized  by  the  use  of 

two  sets  of  N  laser  diodes  in  conjunction  with  2N  detectors  and  N  AND 

gates.  An  example  of  a  possible  arrangement  of  such  a  system,  is  shown 

in  Figure  6.2  for  a  2-bit  multiplier.  Other  possible  arrangements 

involve  the  use  of  fiber-pigtailed  lasers  in  conjunction  with  the 

fiber-optic  1:N  splitters,  or  overlapping  laser  beams  with  detectors 

(located  at  the  cross  points)  in  conjunction  with  threshold  detection, 

etc.  Note  that  the  speed  of  the  optical  part  of  any  possible 

implementation  of  the  system  can  exceed  3  GHz  even  assuming  the  use  of 

state-of-the-art  components.  Thus,  the  throughput  limiting  factor  is 

the  AND  gates  together  with  the  counters  that  follow  them.  Finally,  a 

simple  analysis  shows  that  a  scheme  of  parallel  input  counters,  which 

are  used  in  order  to  obtain  the  sum  of  l’s  for  each  convolution  point, 

is  preferable  to  the  scheme  shown  in  Figure  6.1.  This  is  the  well  known 
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Dadda  scheme  for  implementing  a  fast  many-bit  (>  16/32)  digital 
multiplier. 
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Figure  8.1  Counter  Arrangement  for  a  2-bit  Multiplier 
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Figure  6.2  Laser-Diode/AND-Gate  System  for  Product  Formation 


Thus,  we  have  formed  a  simple  multiplier  system  that  does  not  require 
any  A/D’s  or  comparators  and  which  forms  the  A.B^  products  in  a  single 
clock  cycle.  The  factor  limiting  the  speed  of  the  device  is  the 
electronics  (M  200-300  MHz  with  MECL  II  logic)  and  not  the  optics 
(>  3  GHz).  A  close  look  at  the  device,  however,  reveals  that  optics  is 
used  only  for  the  high-speed  interconnections  (between  data  source  and 
multiplier)  and  not  for  the  actual  product  computation  which  is 
performed  exclusively  by  dedicated  digital  electronics.  Thus,  the 
efficient  implementation  of  binary-valued  BPAM  in  conjunction  with 
binary  detection  results  in  a  system  where  the  computational  role  of 
optics  is  practically  zero.  We  discuss  such  a  device  in  more  detail  in 
Section 

6.4  Performance  Comparison  Conclusions 
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In  this  Section  we  have  examined  the  performance  of  the  AO  processors 
from  the  points  of  view  of  system  efficiency  and  multiplication  speed. 

It  is  found  that  DMAC-based  AO  systems  do  not  compare  favorably  with 
existing  state-of-the-art  electronic  multipliers.  BPAM-based  AO 
systems,  although  superior  to  DMAC-based  systems,  have  a  performance 
which  is  about  the  same  as  that  of  existing  GaAs  devices.  An  attempt  to 
use  BPAM  systems  with  digital  counters,  instead  of  A/D’s  or  comparators, 
results  in  a  system  in  which  optics  is  used  for  the  high-speed 
interconnections  but  not  for  computations. 


7.  OPTICALLY  ADDRESSED  ELECTRONIC  DIGITAL  MULTIPLIERS 


7 . 1  Introduction 


In  the  previous  sections  we  have  examined  techniques  that  allow  Acousto- 

Optic  processors  to  perform  digital-accuracy  arithmetic  computations. 

We  have  shown  that  when  the  multiplication  speed  and  system  efficiency 

figures  are  used  as  performance  measures,  the  AO  systems  have  no 

significant  advantage  over  existing  state-of-the-art  digital  electronic 

multipliers.  However,  one  way  in  which  optics  can  be  used  to  improve 

the  performance  of  electronic  arithmetic  units  is  the  natural  ability  of 

optics  to  perform  interconnections.  There  are  several  reasons  for  this 

and  they  have  been  documented  in  the  excellent  paper  by  Goodman 
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et  al.:  (1)  optical  interconnects  (01)  allow  freedom  from  capacitive 

loading  effects  thus  allowing  high  speed  signal  propagation,  (2)  01 
offer  immunity  from  signal  interference  effects  which  allows  for  massive 
2-D  interconnects,  (3)  01  do  not  have  to  be  planar  (as  opposed  to 
electronic  interconnects),  (4)  if  open  space  01  are  used,  then  some 
reprogrammability  can  be  achieved  via  *dynamic  interconnections'  and 
(5)  optical  signals  can  be  injected  directly  into  electronic  logic 
devices. 

The  above  reasons  clearly  show  the  advantages  of  01  over  their 
electronic  counterparts.  This  chapter  addresses  the  use  of  01  in 
processors  that  can  be  applied  to  the  AFAR  problem.  In  Section  7.2  we 
briefly  discuss  the  concept  of  array  processors  for  matrix-matrix 
multiplication.  In  Section  7.3  we  address  some  practical  issues  which 
are  necessary  for  the  realisation  of  optically-interconnected  array 
processors.  Finally,  in  Section  7.4,  we  discuss  the  prototype 
optically-addressed  ECL  multiplier  which  we  have  fabricated. 
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itically  Interconnected  Array  Processor  for  Matrix  Multiplication 
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A  large  cumber  of  algorithms  (or  parts  of  algorithms)  used  for  the  APAR 

problem  can  be  expressed  in  terms  of  matrix  multiplications.  Typical 

examples  can  be  found  in  some  of  the  algorithms  used  for  the  eigensystem 

solution,  presented  in  Section  2,  as  well  as  in  the  Gram-Schmidt 
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technique  ’  presented  in  Section  9.  Consider  an  example  of  matrix- 
matrix  multiplication.  Let  matrices  A  and  B  each  of  dimensions  MxM  be 
represented  by: 
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Another  way  of  expressing  C  is  through  outer  products  A.B^;  i.e., 
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A  simple  processor  that  implements  Equation  7.5  is  shown  in  Figure  7.1 
and  consists  of  a  square  array  of  MxM  multiplier-accumulator  units 
(MAU) .  In  this  system,  the  column  data  A^  and  row  data  are 
broadcasted  instantly  along  the  columns  and  rows  of  the  array 
processor.  At  each  clock  cycle  there  is  a  new  outer  product  formed 
which  is  consequently  added  to  the  previous  product.  In  this  fashion 
the  total  number  of  cycles  needed  for  a  full  matrix-matrix 
multiplication  is  M.  This  design  is  not  suitable  for  VLSI  circuit 
design  because  it  needs  global  communication.^  Thus,  when  VLSI 
implementation  is  desired,  one  has  to  configure  the  processor  of 
Figure  7.1  in  a  systolic  architecture  so  that  only  local 
interconnections  are  used.  In  such  a  scenario,  the  number  of  cycles 
needed  for  a  full  matrix-matrix  multiplication  depends  on  the  specific 
systolic  implementation  used.  For  a  square  array,  similar  to  the  one 
of  Figure  7.1,  this  number  is  of  the  order  of  2M.  Thus  the  "globally 
interconnected"  array  of  Figure  7.1  offers  the  advantage  of  improving 
the  processing  speed  by  a  factor  of  2.  Note  that  when  high  frequency 
operation  is  desirable  (£  500  MHz)  the  locally  interconnected  array 
processor  probably  requires  the  use  of  01  for  data  transmission.  This 
is  because  at  such  frequencies  the  number,  paths,  lengths,  and 
terminations  of  microstrip  and  strip  lines  are  extremely  critical  in 
order  to  avoid  effects  such  as  capacitive  loading,  delays, 
overshoot/undershoot  etc  (an  extensive  but  simple  treatment  of  this 
issue  can  be  found  in  Reference  16).  Thus,  since  01  seem  to  be 
necessary  for  high  frequency  operation,  it  is  logical  to  use  01  in  a 
global  rather  than  local  fashion. 

7.3  Optical  Interconnects  for  Array  Processor 

In  this  section  we  analyse  the  01  for  the  array  processor  of 
Figure  7.1.  As  we  have  described  in  the  previous  section,  the  column 
data  A^  and  row  data  need  to  be  broadcast  instantly  across  the 
array.  This  implies  that  we  need  to  address  M  different  MAU  inputs 


with  the  sane  data.  This  in  turn  implies  that  we  need  to  split  the 
data  1:M.  Thus,  we  need  to  address  a  simple  technique  which  uses 
state-of-the-art  technology  and  which  allows  us  to  drive  M  optical 
channels  from  one  optical  source.  Before  we  describe  such  a  technique 
it  is  of  interest  to  discuss  a  technique  for  coupling  the  optical 
source  (laser  diode)  into  a  fiber. 

A  schematic  drawing  of  a  fiber-optic  adapter  which  we  have  developed 
for  efficient  coupling  of  the  output  power  from  a  Mitsubishi  ML4402 
laser  diode  into  a  fiber  of  core  £  100  pm  is  shown  in  Figure  7-2.  The 
output  beam  from  the  laser  is  elliptical  with  beam  divergencies  of  33* 
and  11*,  full  angular  spread  at  the  half-power  points,  along  the  major 
and  minor  axes,  respectively.  To  collect  this  angular  spread  directly 
into  a  fiber  without  the  use  of  lenses  requires  that  the  fiber  end  face 
be  positioned  much  closer  to  the  diode  emitting  surface  than  the 
windowed  laser  package  allows.  Accordingly,  we  have  removed  the  window 
and  protective  can  from  the  laser  to  allow  complete  access  to  the 
emitting  surface. 

For  the  fiber,  we  have  chosen  a  glass  fiber  of  100  p  core  diameter  and 
140  pm  cladding  diameter,  since  this  represents  a  fiber  having  one  of 
the  highest  aspect  ratios  readily  available.  Although  this  factor  is 
not  critical  for  the  multiplier  application  (see  Section  7.4),  it  is  an 
important  parameter  when  the  outputs  from  many  fibers  are  combined  in  a 
fan-in  or  fan-out  as  is  the  case  of  the  array  processor  or  the  case  of 
the  application  to  look-up  tables  discussed  later  in  Section  8.3. 

Thus,  to  intercept  all  the  laser  output,  out  to  the  half-power  points, 
into  a  100  pm  core  fiber  requires  that  the  emitting  surface-fiber  face 
separation  be  less  than  160  pm.  The  fiber-optic  adapter  shown  in 
Figure  7.2  accomplishes  this,  taking  into  account  the  manufacturing 
tolerance  levels  in  the  height  of  the  diode  surface  above  the  base  of 
the  laser. 


Figure  7.2  ML  4402  Laser  Diode  Pigtail  Adapter 


The  adapter  consists  of  a  plexiglass  rod  bored  at  one  end  to  be  a  slide 
fit  to  the  diaaeter  of  the  laser  diode  base.  The  diode  is  inserted 
into  the  adapter  until  it  bottoms  on  the  machined  step  in  the  adapter 
and  is  cemented  to  the  adapter  using  UY-cured  epoxy.  In  this  way  the 
diode  emitting  surface  is  precisely  located  (within  diode  manufacturing 
tolerances)  with  respect  to  the  step  in  the  plexiglass  adapter.  The 
cemented  diode  sits  completely  within  the  adapter  to  allow  convenient 
attachment  of  the  insulating  adapter  to  a  circuit  board  without  danger 
of  shorting  the  diode  base  (which  forms  one  of  the  diode  connections) 
to  the  board.  The  other  end  of  the  adapter  is  machined  and  threaded  to 
accept  (in  this  case)  a  standard  Simplex  ferrule  connector.  The  length 
of  the  adapter  above  the  step  (which  locates  the  base  of  the  diode)  is 
such  that  a  ferrule,  polished  to  its  standard  length,  when  inserted 
into  the  axial  hole  in  the  adapter  is  positioned  on  axis  and  with  the 
face  of  the  ferrule  located  within  the  distance  of  160  /la  required  to 
intercept  the  divergent  output  beam  of  the  laser.  This  adapter  has 
proven  to  be  a  simple,  efficient,  and  reproducible  means  of  coupling 
the  output  from  the  laser  diode  into  a  100  /»m-core  fiber.  All  of  the 
hundred  diode-adapter  assemblies  fabricated  so  far  have  given  the  full 
maximum  diode  output  of  5  mV  measured  at  the  end  of  the  fiber  for  diode 
currents  £  10  mA  above  the  threshold  current. 

Splitting  one  optical  channel  into  k  optical  channels  can  be  achieved 

via  the  use  of:  (1)  holograms,  (2)  star  couplers  in  conjunction  with 

fibers  and  (3)  fiber-optic  splitters  that  use  resilient-ferrule 

connectors.  The  first  approach  is  practical  in  applications  where  the 

k  optical  channels  distribute  their  information  in  a  relatively  small 
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area;  e.g.,  100-400  cm  .  The  second  and  third  techniques  allow  signal 
distribution  without  any  practical  restriction  in  area  or  path  length. 
Star-couplers,  however,  are  more  expensive  and  have  a  higher  loss  (for 
k  ~  40) .  For  these  reasons  we  have  decided  to  use  the  resilient- 
ferrule  connector  approach  (RFC) . 
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In  the  RFC  approach  (Figure  7.3)  the  optical  source,  usually  a  laser 
diode,  is  coupled  to  a  relatively  wide  core  diameter  fiber  (e.g.,  420 
pm) .  The  output  of  the  wide  fiber  produces  a  uniformly  distributed 
spot  of  light.  We  then  pack  a  number  of  smaller  diameter  fibers  into  a 
resilient-ferrule  connector  (Figure  7.3),  e.g.,  7  fibers  of  140  pa 
cladding  and  100  pm  core.  Thus,  we  create  an  effective  active  area  of 
380  pa  diameter  which  is  covered  by  the  cores  of  the  fibers.  Each 
fiber  receives  nearly  the  same  amount  of  light.  The  other  end  of  each 
fiber  is  terminated  in  a  separate  connector  which  is  used  for  the  MAU 
connection.  Note  that  by  reversing  input/output  ends,  we  can  use  the 
RFC  arrangement  as  an  11:1  combiner.  This  is  exactly  the  way  we  use  it 
for  the  look-up  table  residue  approach  we  discuss  in  Section  8.3.  Note 
that  the  RFC  technique  has  losses  that  are  comparable  to  those  found  in 
connectors  (about  0.5  dB)  and  is  less  costly  since  it  eliminates  the 
expense  of  the  coupler  itself. 

For  efficient  coupling,  we  need  to  maximize  the  core-to-cladding  ratio 
(CCR)  of  the  small  fibers  as  well  as  the  total  effective  receiving  area 
of  the  fiber  bundle  in  the  RFC  (shaded  area  in  Figure  7.3).  The  former 
is  needed  in  order  to  maximize  the  effective  receiving  area  per  fiber. 
The  latter  is  necessary  in  order  to  minimize  the  amount  of  unused 
light.  Maximizing  the  CCR  implies  that  we  avoid,  if  possible,  the  use 
of  single-mode  fibers  which  have  a  very  small  CCR  of  the  order  of  0.04 
(e.g.,  5  pm  core  and  125  pm  cladding).  Note,  however,  that  single-mode 
fibers  are  the  only  choice  if  multi-Gb/s  data  rates  are  needed.  In  a 
multi-mode  fiber,  typical  CCR’s  are  of  the  order  of  0.7  (e.g.,  100  pm 
core  and  140  pa  cladding).  For  these  fibers,  and  at  X  =  850  nm,  the 
typical  transmission  bandwidth  is  about  100-200  MHz-Km.  For  our 
application,  maximum  distances  of  the  order  of  a  meter  are  expected. 

For  these  distances  effective  data  rates  of  >  1  Gb/s  can  be  easily 
achieved.  In  fact,  we  have  shown  that  for  2  meters  of  50112  AMP  fiber, 
data  rates  of  >  1.2  Gb/ii  can  be  achieved.  Maximizing  the  total 
receiving  area  is  equivalent  to  efficient  fiber  packaging  Or.e  a- 
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easily  show  that  for  efficient  packaging,  a  symmetric  fiber  arrangement 
similar  to  the  one  shown  in  Figure  7.3,  is  needed.  In  such  an 

•i 

arrangement,  the  total  number  of  fibers  Mq  and  the  overall  diameter  D 
are  given  by: 


Mq  *  1  +  3k  (k  ♦  1) 


D  =  (2k  +  1)  d 


where  k  =  1,2,3,...  and  d  is  the  cladding  diameter.  In  this  case  the 
aspect  ratio  Aq  of  the  total  area  is 


Ao  -  [1  +  3k  (k  +  1) ] <*/ (2k  ♦  1)‘ 


where  a  =  (dj/d )  is  the  aspect  ratio  of  a  single  fiber  of  core 
diameter  d^.  In  Table  7.1  we  show,  as  a  function  of  k,  the  total 
number  of  fibers  Mq,  the  array  diameter  D,  the  overall  efficiency  17, 
and  the  core  diameter  dg,  the  laser  coupling  fiber.  For  these 
calculations  we  assume  that  each  fiber  in  the  bundle  has  a  100  /<m  core 
and  a  140  fim  cladding,  i.e.,  a  =  0.51.  For  demonstration  purposes,  we 
have  implemented  the  cases  for  k  =  1  and  2  using  50112  AMP  100  /tm/140 
fim  fiber.  Typical  output  radiation  patterns  are  shown  in  Figure  7.4. 

In  both  cases  we  obtain  coupling  efficiency  results  that  are  in 
excellent  agreement  with  the  figures  of  Table  7.1.  Note  that  our 
experimental  results  show  that  for  M  =  16  and  19  there  is  little 
difference  in  17.  Thus,  for  all  practical  calculations  involving  M  =  16 
one  can  use  the  M  =  19  data.  Also  note  that  because  of  practical 
reasons  (availability  of  proper  fibers,  connector  dimensions,  etc.) 

M  =  19  is  probably  the  upper  limit  of  the  RFC  technique. 


We  now  discuss  some  issues  associated  with  the  amount  of  optical  power 
necessary  for  the  interconnections.  Let  us  assume  that  we  are  dealing 
with  a  16  x  16  array  processor  which  we  intend  to  interconnect  using 
discrete  detectors.  For  high  data  rates  (e.g.,  several  hundreds  of 
KHz)  we  need  the  detector  to  interface  to  an  impedance  of  about  50 
ohms.  Thus,  for  a  IV  swing,  the  detector  needs  to  provide  20  mA. 
Current  state-of-the-art  high  speed  (rise  time  £  0.5  nsec)  pin  diode 
detectors  (e.g.,  Motorola  MFOD  1100)  have  a  responsivity  of  0.3  /<A//iW. 
Thus,  we  require  about  67  mV  of  optical  power  incident  on  the  detector. 
Since  M  =  16,  the  individual  fiber  coupling  efficiency  is  0.39/16  = 
0.024.  Thus,  the  total  optical  power  incident  on  the  RFC  needs  to  be 
67/0.024  mV  =  2.8  V.  This  power  is  beyond  that  obtainable  from  state- 
of-the-art  laser  diodes  so  that  buffers  must  be  incorporated  between 
detectors  and  MAU  inputs.  A  typical  example  of  such  a  buffer  unit  is 
the  Advance  Micro  Devices  Am  6687  comparator  which  allows  for  data 
rates  of  up  to  300  MHz.  This  device  requires  a  minimum  input  of  5  mV 
which  corresponds  to  a  total  laser  power  of  14  mV.  State-of-the-art 
low-cost  laser  diodes,  such  as  the  Hitachi  HLP  1400,  can  deliver  up  to 
20  mV  at  data  rates  in  excess  of  800  Mb/s.  It  is  thus  our  conclusion 
that  if  the  discrete  detector/buffer  approach  is  used  in  conjunction 
with  existing,  low  cost  technology,  then  the  01  for  a  fully  parallel  16 
x  16  array  processor  with  data  rates  in  excess  of  300  Mb/s  per  channel 
can  be  built.  If  the  electronic  MAU’s  are  capable  of  following  the 
above  data  rates,  then  the  system  MS  will  be  300  MHz  and  the  total 
throughput  rate  will  exceed  16  x  18  x  300  x  10®  M-A/s  =  76  x  10®  M-A/s 
which  is  obviously  a  tremendous  processing  capability.  Note  that  if 
the  detectors  can  be  integrated  with  the  MAU  chip,  then  the  effective 
impedances  will  increase  perhaps  by  an  order  of  magnitude.  In  this 
case  the  buffers  are  not  needed  and  the  01  that  consists  of  existing, 
low-cost  laser  diodes/RFC/detectors  can  deliver  data  rates  of  the  order 
of  1  Gb/s. 
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In  the  following  section  we  present  an  experimental,  optically 
addressed,  4x4  bit  ECL  MAU  that  uses  components  similar  to  the  ones 
we  hare  been  considering. 

7.4  Prototype  Optically  Addressed  ECL  Multiplier 

To  illustrate  the  capability  of  optical  interconnections  we  have 
fabricated  a  4x4  bit  optically  addressed  multiplier  based  on  ECL  logic. 
The  complete  arrangement  is  shown  schematically  in  Figure  7.5. 

The  optical  data  generator  consists  of  eight  pulsed  laser  diodes 
(Mitsubishi  Type  ML  4402)  which  provide  the  two  4-bit  words.  These 
laser  diodes  are  driven  from  a  common  pulse-generator  source  (Hewlett 
Packard  Type  8082A)  which  is  fanned  out  to  eight  lines  each  of  which  is 
connected  to  the  transistor  drive  (Motorola  Type  2N5943)  of  each  laser 
diode.  Each  laser  diode  is  housed  in  a  fiber  optic  adapter  (described 
in  Section  7.3)  which  accepts  a  standard  Simplex  fiber  connector.  This 
optical  data  generator  may  be  driven  at  the  maximum  frequency  of  the 
pulse  generator  (250  MHz  for  a  50%  duty  cycle)  thereby  providing  the 
equivalent  of  a  500  MHz  binary  (0,1)  data  rate. 

The  optical  interconnect  consists  of  eight  fiber  optic  lines  fabricated 
from  100  pm  core/140  pm  cladding  cable  (AMP  50112) .  These  lines  are 
provided  with  Simplex  connectors  at  either  end  for  coupling  the  laser 
diode  output  from  the  data  generator  board  to  the  optical  interface/ECL 
multiplier  board. 

A  schematic  diagram  of  the  circuit  used  for  the  optical  interface/ 
multiplier  is  shown  in  Figure  7.6.  The  optical  interface  is  provided 
by  pin  diode-comparator  combinations.  Each  optical  input  signal 
corresponding  to  a  single  binary  bit  is  fed  to  a  pin  diode  detector 
(Motorola  Type  MFOD  1100)  housed  in  a  Simplex  fiber  connector  mount  to 
accept  the  fiber-optic  interconnection.  The  output  from  each  pin  diode 
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Figure  7.5  Schematic  Diagram  of  the  Optically  Addressed  Multiplier 


is  in  turn  fed  to  a  comparator  (AMD  Type  AM  6687)  which  generates  a 
standard  ECL  logic  pulse  provided  the  amplitude  of  the  output  pulse 
from  the  pin  diode  exceeds  a  preset  threshold  level. 


Insofar  as  the  multiplier  itself  is  concerned,  there  are  several 

architectures  available  for  configuring  electronic  gates  to  perform 
17 

multiplication.  For  the  present  demonstration  we  have  chosen  to  use 
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the  Pezaris  arrangement  which  utilizes  full  adders  as  shown 
schematically  in  Figure  7.7.  Thus,  for  two  4-bit  binary  numbers  A  = 
a^a^a^a^  and  B  =  bgbgbjbQ,  the  product  Z  =  A*B  =  (a^a^a^ •  (bgbgbjbQ) 
may  be  written  in  the  form 


*2b0 

*lb0 

*3bl 

a2bi  a1b1 

*0bl 

*3bl 

*2hl 

a1b1  aQb1 

*3b2 

*2b2 
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Z8 

ZS 

Z4 

Z3  Z2 
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corresponding  to  the  practical  implementation  of  Figure  7.7. 

The  building  block  of  the  multiplier  is  a  2x1  bit  array  multiplier  from 
the  MECL  Series  (Motorola  Type  MC  10287)  which  is  a  dual  package  each 
half  incorporating  two  input  AND  gates,  for  forming  the  binary  bit 
products  followed  by  a  full  adder  for  summing  the  products,  with 
internal  carry  lookahead  for  high  speed  operation.  The  logic  diagram  of 
the  array  multiplier  block  is  shown  in  Figure  7.8  and  is  particularly 
suited  for  use  in  the  Pezaris  architecture.  Thus,  the  data  output  from 
the  interface  comparators  are  fed  to  the  inputs  of  six  MC  10287  packages 
the  outputs  of  which  provide  the  products  to  Z^.  The  least 
significant  bit  product  Z0  =  a0bQ  is  provided  by  a  single  AND  gate. 


of  the  Peaaris  Multiplier 


Figure  7.8  Logic  Diagram  of  the  MC  10287  Array  Multiplier  Block 


A  strobed  latch/LED  display  provides  visual  readout  of  the  product  from 
the  multiplier.  Each  of  the  product  bits  from  the  multiplier  is  fed  to 
a  latch  (Motorola  Type  10175)  which  is  strobed  by  a  suitably  delayed 
pulse  derived  from  the  same  pulse  generator  used  to  produce  the  optical 
data.  Thus,  the  latches  are  strobed  synchronously  with  the  data,  each 
output  pulse  from  the  latches  driving  the  specific  LED  associated  with 
each  product  output  bit. 

Layout  and  fabrication  of  the  double-sided  boards  was  carried  out  on  a 
CAD/CAM  facility.  In  the  board  layout,  particular  attention  was  paid  to 
the  design  to  ensure  high-speed  operation.  A  photograph  of  the 
assembled  boards  is  shown  in  Figure  7.9. 

For  the  purpose  of  exercising  the  multiplier,  the  input  words  A  and  B 
are  varied  by  connecting  the  appropriate  fiber-optic  lines  between  the 
data-generator  and  multiplier  boards.  A  composite  record  showing  the 
input/output  pulse  responses  from  the  multiplier  board  is  shown  in 
Figure  7.10(a).  The  upper  trace  on  this  record  corresponds  to  the  input 
data-bit  pulse  from  the  interface  comparator  (all  eight  pulses  are 
coincident  in  time)  followed,  in  sequence,  by  the  eight  output  bit 
pulses  corresponding  to  the  products  Zq  to  Z^,  respectively.  This 
record  has  been  obtained  by  varying  the  input  word  A  keeping  the  full 
input  word  B  (i.e.,  bg  =  =  b^  =  bp  =  1)  fixed.  From  Figure  7.10(a) 

it  may  be  seen  that  the  delay  time  between  the  input  data  pulse  and  the 
output  product  pulse  increases  from  Zq  through  Zg  with  the  most 
significant  bit  Z^  (the  final  carry  bit)  delayed  somewhat  less  than  Zg. 
The  various  delays  measured  from  Figure  7.10(a)  are  compared  with  those 
expected  from  the  multiplier  architecture  used  in  Table  7.2.  For  the 
latter,  we  have  characterized  the  delays  in  terms  of  the  unit  gate  delay 
time  A.  Thus,  the  propagation  delay  times  of  the  AND,  OR,  and  XOR  gates 
incorporated  in  the  multiplier  blocks  are  2A,  2A  and  3A,  respectively. 
Taking  a  value  of  A  =  0.3  ns  (corresponding  to  a  propagation  delay  of 
an  ECL  NAND  gate) ,  we  obtain  the  values  shown  in  the  last  column  of 
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Figure  7.10  (a)  Composite  oscilloscope  record  showing  in  sequence  from 

top  to  bottom,  the  relative  delays  between  the  input 
data  bit  pulse,  and  the  various  output  bit  pulses 
corresponding  to  the  products  Zq  to  Z^,  respectively. 

(b)  Composite  oscilloscope  record  showing  in  sequence  from 
top  to  bottom,  the  input  data  pulse  and  output  bit 
pulses  corresponding  to  Z  (minimum  delay)  and  Z- 
(maximum  delay)  for  maximum  operating  speed  of  tne 
multiplier  (220  MHz) . 


Table  7.2 


Input  to  Zq  Propagation  Delay  Times 


Estimated 

Measured 

(ns) 

A 

ns 

(A  s  0.3  ns) 


Table  7.2.  These  are  in  good  agreement  with  the  measured  values.  The 
total  multiplication  time  is  dependent  on  the  value  of  the  input  words. 
The  maximum  time  is  evidently  that  for  which  the  product  includes  the 

A 

2  (Zg)  bit.  For  the  present  arrangement  this  propagation  delay  is 
measured  to  be  11  ns  which  is  in  reasonable  agreement  with  the 
manufacturer’s  specification  of  14  ns. 

In  characterizing  the  performance  of  multipliers  we  wish  to  distinguish 
between  the  total  multiplication  time  and  the  throughput  rate.  In  the 
present  arrangement,  the  maximum  throughput  rate  is  governed  by  the 
requirement  that  the  data  bit  pulse  be  sufficiently  wide  that  all  of 
the  product  bit  pulses  be  present  during  the  strobe  pulse.  Thus,  the 
minimum  pulse  width  is  given  by  the  maximum  difference  in  propagation 
delay  among  the  product  bit  pulses,  i.e.,  the  difference  in  propagation 
delay  for  Z q  and  Zg.  From  Table  7.2  this  is  measured  to  be  9  ns. 

Thus,  the  present  multiplier  has  a  maximum  data  throughput  rate  of  110 
MHz.  However,  we  note  that  if  a  truly  pipelined  architecture  is  used 
the  throughput  rate  may  be  significantly  increased.  While  we  are  not 
able  to  do  anything  in  the  present  demonstration  at  the  chip  level  we 
note  that  we  can  exercise  the  board  at  a  higher  data  throughput  rate 
than  is  dictated  by  the  strobe  requirements.  An  example  of  this  is 
shown  in  Figure  7.10(b)  which  shows  from  top  to  bottom,  the  input  data 
pulse,  the  Zq  output  bit  pulse,  and  the  Zg  output  bit  pulse, 
respectively,  where  the  input  data  pulse  width  has  been  minimized  while 
still  maintaining  all  output  bit  pulses.  From  Figure  7.10(b)  it  may  be 
seen  that  a  data  throughput  rate  of  220  MHz  is  possible  if  suitable 
delays  are  introduced  in  the  circuit  so  that  the  output  bits  can  be 
strobed  simultaneously.  This  throughput  rate  is  limited  both  by  the 
comparator  delay  time  in  the  optical  interface  and  by  propagation 
delays  in  the  multiplier  chips  themselves. 

In  conclusion,  we  see  that  existing  fiber-optic  and  electronic 
technology  can  be  used  in  order  to  fabricate  a  fully  parallel  optically 


interconnected  square  array  processor  for  matrix-matrix  multiplication 
With  this  existing  technology  we  expect  multiplication  speeds  that 
exceed  200  MHz  (assuming  a  pipelined  architecture) .  Such  an  array 
processor  is  obviously  easier  to  implement  than  the  Acousto-Optic 
processors  we  have  presented  in  Section  4.  This  further  demonstrates 
that  Acousto-Optic  systems  cannot  compete  with  existing  electronic 
technology. 


8.  RESIDUE  LOOK-UP  TABLE  ELECTRO-OPTIC  PROCESSING 


8.1  Introduction 

In  this  section  we  discuss  a  residue  arithmetic  approach  for  high-speed 
Electro-Optic  processing.  As  we  have  shown  in  Section  5,  Acousto-Optic 
binary  processors  cannot  compete  efficiently  with  existing  digital 
electronic  counterparts.  One  of  the  reasons  is  that  Acousto-Optics 
performs  only  part  of  the  operation,  i.e.,  the  convolution,  and 
power-hungry  electronics  is  needed  in  order  to  convert  the  mixed-binary 
data  of  the  convolution  into  a  conventional  binary  form.  If  operation 
in  the  mixed-binary  form  would  be  possible  for  many  processing  steps, 
then  the  system  efficiency  would  improve  because  of  the  fewer 
conversions  that  would  be  necessary.  Unfortunately,  this  is  not  the 
case.  Thus  we  have  decided  to  explore  other  arithmetic  schemes  which 
allow  many  operations  to  be  performed  before  conversion  into  a  more 
conventional  arithmetic  is  needed.  One  such  possibility  is  residue 
arithmetic. 

In  the  following  Section  8.2  we  present  a  brief  discussion  of  the  basics 
of  residue  arithmetic.  In  Section  8.3  we  describe  a  possible  look-up 
table  (LUT)  technique  for  high  speed  processing  and  in  Section  8.4  we 
discuss  our  prototype  LUT.  In  Sections  8.5  and  8.6,  we  show  how  one  can 
convert  from  binary-to-residue  and  from  residue-to-binary,  via 
utilization  of  LUT  techniques.  In  Section  8.7,  we  discuss  issues 
associated  with  hardware  minimization.  Finally,  in  Section  8.8  we 
discuss  an  example  of  residue  LUT  processing,  a  square  array  processor 
for  matrix-matrix  multiplication. 
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8.2  Residue  Arithmetic  Basics 


Residue  arithmetic,  because  of  lack  of  carries,  is  probably  the  fastest 
way  of  perforaing  addition,  subtraction,  multiplication  and  various 
polynoaial  transformations.  The  residue  number  system  (RNS)  is  based 
upon  N  fixed  relatively  prime  integers  m^,  m^,  — ,  m^  which  are  called 
moduli  (or  base) .  An  integer  number  X,  that  lies  in  the  range  0  to 
(M-l)  (M  is  the  product  of  the  N  moduli)  is  uniquely  represented  with 
respect  to  the  N  moduli  via  the  Ntuple  of  residues  (RBp  Rm9»  . ..,  R^) . 
Each  residue  R  .  is  defined  to  be  the  least  positive  integer  remainder 

BX 

by  the  division  of  X  by  m^.  For  example,  for  the  5  moduli  7,  9,  11,  13, 
16  the  maximum  range  If  is  equal  to 

H  =  7  x  9  x  11  x  13  x  IS  =  144,144  (8. 

Thus,  this  set  of  moduli  allows  us  to  represent  any  integer  in  the  range 
0-144,143.  For  example,  279  is  represented  by  (6, 0,4, 6, 7).  Note  that 
it  is  convenient  to  have  an  even  modulo  so  that  we  can  detect  negative 
numbers  easily.  In  this  case,  the  range  0-(M/2  -  1)  is  used  to 
represent  positive  numbers,  whereas  the  range  M-l  to  M/2  is  used  to 
represent  negative  numbers.  Note  that  in  the  latter  case  M  must  be 
subtracted  in  order  to  obtain  the  correct  answer. 
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To  perform  residue  arithmetic  operations,  we  first  convert  all  the 

numbers  of  interest  into  RNS.  Ve  then  perform  the  arithmetic  operation 

by  operating  on  their  RNS  representations.  The  specifics  of  the  RNS 

operation  depends  on  the  specific  arithmetic  operation  we  are 

performing.  In  all  cases,  however,  operations  over  different  moduli  are 

independent,  there  are  no  carries  and  the  result  of  the  operation  on 

modulo  m.  cannot  exceed  m.-l. 

1  x 
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To  add  in  RNS  we  simply  add  the  corresponding  residues  and  then  we  find 
the  residues  with  respect  to  each  modulo.  For  example,  in  base 
(7,9,11,13,16)  the  sum  279  ♦  31  =  310  (31  in  RNS  is  (3,4,9,5,15))  is 


(6, 0,4, 6, 7) 


(3,4,9,5,15) 


(2,4,2,11,6)  =  310. 

To  subtract  in  RNS  we  change  each  residue  digit  of  the  subtrahend  by  its 
complement  and  then  perform  an  addition.  In  RNS  the  complement  of  a 
residue  is  its  difference  from  the  modulo;  e.g.,  in  base  13  the 
complement  of  residue  9  is  4.  For  example  in  base  (7,9,11,13,16)  the 
difference  279-31  =  248  in  RNS  is 


(6,0,4, 6,7) 


(4, 5, 2, 8,1) 


(3,5,8, 1,8)  =  248 


To  multiply  in  RNS  we  multiply  the  residues  at  each  modulo  and  then  find 
the  resulting  residue.  For  example,  279  x  31  =  8,649  is 


S*5N 


38 


(6,0,4, 6,7) 


(3,4,9,5,15) 


(8,0, 3,4,9)  =  8,649 

Division  in  RNS  is  not  always  possible.  RNS  by  definition  represents 
integers.  The  division  of  two  integers  is  not  always  an  integer  and 
thus  it  cannot  be  represented  in  RNS.  However,  in  some  special  cases, 
division  can  be  performed  by  means  of  multiplicative  inverses.  An 
integer  Y  is  called  the  multiplicative  inverse  of  X  if  the  product  YX 
with  respect  to  modulo  m  is  1.  For  example,  in  modulo  11  the 
multiplicative  inverse  of  5  is  9.  To  use  multiplicative  inverses  for 
division,  the  result  of  the  division  must  be  an  integer  and  the  divisor 
must  not  contain  any  moduli  as  factors. 

8.3  Residue  Look-Up  Table  Processing 

Because  of  the  lack  of  carries,  residue  arithmetic  allows  independent 
calculations  per  modulo  without  the  need  for  different  modulo  processors 
to  cross-communicate .  A  general  RNS  processing  scenario  is  shown  in 
Figure  8.1.  In  such  a  general  scenario,  binary  data  are  fed  into  a 
binary-to-residue  converter  (B/R) .  The  converter  feeds  its  outputs  (N 
outputs  for  N-moduli  operation)  to  N  different  processors.  All 
processors  perform  exactly  the  same  RNS  function (s)  but  with  respect  to 
a  different  modulo  m^.  Upon  completion  of  the  operation,  the  outputs 
from  the  N  processors  are  fed  into  a  residue-to-binary  converter  (R/B) , 
whose  output  is  the  result  expressed  in  a  conventional  binary  form. 


Because  of  this  inherent  lack  of  carry  propagation,  high  speed 
processing  can  be  achieved.  In  addition,  all  processors  are  similar  and 
no  global  interconnections  are  needed  (a  very  important  consideration  in 
a  VLSI  conf iguration) .  Since  the  system  dynamic  range  is  the  moduli 
product,  high  dynamic  range  is  achievable  by  using  more  parallel 
channels.  Thus,  high  dynamic  range  can  be  achieved  without  reduction  of 
the  processing  speed. 

Let  us  now  concentrate  on  the  RNS  processing  itself  (B/R  and  R/B 
conversion  is  discussed  in  detail  in  Section  8.5  and  8.6)  under  the 
assumption  that  residue  representations  are  available.  Because  of  the 
lack  of  carries,  the  bounded  output  range,  etc.,  in  RNS,  various 
arithmetic  operations  may  be  implemented  via  the  use  of  the  look-up 
table  (LUT) .  The  idea  is  illustrated  in  Figure  8.2  where  the  LUTs  for 
multiplication  and  addition  in  modulo  5  are  shown.  For  the 
multiplication  LUT,  the  objective  is  to  create  the  product  of  numbers  X 
and  Y  without  performing  an  actual  multiplication.  For  modulo  bk  ,  the 
LUT  has  two  sets  of  inputs  (one  for  X  and  one  for  Y) ,  each  of  which 
consists  of  nu  inputs  (X  or  Y  can  take  only  nu  different  values) .  For 
different  values  of  the  inputs,  we  obtain  a  different  value  at  the 
output.  These  outputs  are  pre-calculated  and  stored  and  upon 
interrogation  of  the  LUT  with  the  inputs  are  read  out. 

Various  forms  of  LUT  implementation  have  been  proposed  and  analyzed  in 

the  literature  in  the  last  decade.  Among  the  architectures  suggested 
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are  those  of  Huang  et  al,  Tai  et  al  and  Polky  et  al,  which  are 

more  or  less  designed  around  electro-optic  control  of  beams  propagating 

in  integrated  optical  waveguide  structures.  Another  approach  is  the  one 
23 

by  Gaylord  et  al,  which  is  based  on  binary  coded  residue  LUTs.  Such  a 
processor  exploits  the  multiple  parallel  channel  processing  capability 
that  is  inherent  in  optical  systems.  It  performs  EXCLUSIVE  OR  and  NAND 
logic  operations  through  the  use  of  optical  LUTs,  which  are  based  on 
holographic  recordings  or  the  use  of  spatial  light  modulators. 
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A)  MULTIPLICATION 


B)  ADDITION 


Figure  8.2  Look-up  Tables  for  Multiplication  and  Addition  in  Modulo  5 
Residue  Arithmetic 


Our  approach  for  implementing  residue  LUTs  is  based  on  the  utilization 
of  small,  high-speed  light  emitting  diodes  (LEDs)  or  laser  diodes  (LDs) 
in  conjunction  with  fiber-optic  combiners  or  holograms.  Numerical 
operations  are  performed  simply  by  generating  a  light  pulse  which 
reaches  a  detector  that  has  been  encoded  for  the  number  resulting  from 
each  operation.  Thus,  for  the  modulo  5  multiplication  (Figure  8.2) 
light  produced  at  the  intersection  of  inputs  3  and  2  drives  the  detector 
labelled  1.  Similarly,  for  the  modulo  5  addition,  the  light  generated 
at  the  intersection  of  3  and  2  illuminates  a  detector  encoded  0. 

One  way  for  implementing  this  concept  is  through  an  interlaced 
two-dimensional  grid  of  electrodes  in  conjunction  with  high-speed  LEDs 
or  LDs  at  the  intersection  points  (Figure  8.3).  A  voltage  pulse  applied 
to  each  input  line,  such  that  the  pulse  amplitude  is  less  than  the  LED 
junction  voltage  but  that  twice  the  pulse  amplitude  exceeds  it  by  a 
considerable  margin,  causes  the  diode  at  the  intersection  point  to  emit 
strongly.  The  emitted  light  is  transmitted  to  a  detector  that  is 
encoded  for  the  number  to  be  produced  at  that  table  location,  as 
indicated  by  the  number  in  each  grid  box.  To  minimize  the  number  of 
detectors  required  and  to  promote  flexibility  in  LUT  geometry,  we  use 
fibers  (or  a  hologram)  to  transmit  light  from  each  diode  that 
corresponds  to  a  given  digit  to  the  single  detector  encoded  for  that 
digit. 

Other  arithmetical  operations  use  the  same  LUTs  in  combination  with 
one-sided  subprocessors  (or  wiring  maps)  which  precondition  some  of  the 
inputs  to  these  LUTs.  Thus,  subtraction  proceeds  through  formation  of 
the  additive  inverse  of  the  subtrahend  followed  by  look-up  of  the 
difference  in  the  addition  table,  while  division  proceeds  via  the 
multiplication  table  after  first  forming  the  multiplicative  inverse  of 
the  divisor.  The  wiring  maps  which  form  these  inverses  in  modulo  5  are 
shown  in  Figure  8.4.  Additional  operations  such  as  raising  to  an 
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Figure  8.4  Look-up  Tables  (Wiring  Maps)  for  Additive  and 
Multiplicative  Inverses  in  Modulo  5  Arithmetic 


integral  power  or  an  inverse  integral  power  are  also  performed  using 
wiring  maps  which  do  not  require  generation  and  detection  of  light 
pulses.  This  shows  the  increased  flexibility  of  the  RNS  LUTs  as 
compared  with  DMAC  and  BPAM.  The  inputs  and  outputs  of  the  LUT  remain 
in  the  same  residue  base  and  thus  different  LUTs  can  be  easily 
interconnected  without  the  need  for  conversion.  Such  capabilities 
allow  fast  data  flow  and  eaae  of  pipelining  as  we  show  in  Section  9. 

We  now  estimate  the  expected  performance  of  the  LUTs.  MS  figures  of 

the  order  of  1  GHz  are  to  be  expected  since  laser  diodes  have  already 

24 

achieved  sub-nsec  switching  times  and  operation  of  LEDs  at 
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frequencies  >  1  GHz  has  also  been  reported.  To  estimate  the  power 
consumption  we  must  consider  a  specific  computation  example  such  as  the 
multiplication  of  two  8-bit  numbers.  The  possible  product  range 
(6.5  x  10  )  can  be  covered  with  the  moduli  3,  5,  7,  8,  11  and  13. 
Assuming  1  GHz  operation,  a  signal/noise  ratio  of  ~  30  (corresponding 
to  a  negligible  bit  error  probability)  and  the  use  of  detectors  with 
NEP  of  ~  10  *^W/^flz,  we  find  that  the  optical  power  incident  on  the 
detector  is  about  0.1  fill.  Furthermore,  assuming  ~  10  dB  fiber  coupling 
losses  and  1%  diode  conversion  efficiency  we  find  that  the  total 
electrical  power  per  operating  diode  is  ~  100  /*W.  If  10?t  of  this  power 
is  used  for  diode  prebiasing  then  the  total  prebias  power  consumption 
is  about  4.4  mW.  Adding  to  this  figure  the  power  required  to  turn  on 
the  proper  diodes  (2  x  47  x  100  /*W  =  10  mW)  we  find  that  the  total 
power  consumption  of  the  diodes  is  M  14.4  mW.  Next,  we  calculate  the 
power  consumption  necessary  to  drive  a  LUT  from  another  LUT,  which  is 
usually  accomplished  via  the  use  of  buffers.  This  is  an  important 
issue  which  needs  to  be  investigated  in  detail  but  preliminary  results 
suggest  that  only  one  buffer  unit  per  LUT  (the  one  connected  to  the 
"on*  detector)  need  be  on  at  any  time.  Assuming  this  to  be  the  case 
and  that  the  buffer  unit  is  only  1%  efficient,  we  find  that  the  total 
buffer  power  consumption  is  6  x  100  x  100  /*W  =  60  mW.  Thus,  the  total 
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power  consumption  is  75  mW  which  when  associated  with  a  MS  of  1  GHz, 
corresponds  to  a  SB  of  *  1.3  x  10A  M-A/sec.  W. 

This  preliminary  analysis  suggests  that  the  LUT  structures  considered 
here  may  offer  MS  and  SE  advantages  of  about  an  order  of  magnitude  over 
those  of  DMAC,  BP AM  and  GaAs  processing  units. 

8.4  LUT  Experimental  Results 


To  demonstrate  as  well  as  further  understand  the  LUT  concept  we  have 
fabricated  a  modulo  7  LUT  which  can  be  configured  either  as  an  adder  or 
as  a  multiplier  by  changing  the  fiber-optic  connections  (Figure  8.5). 
The  laser  diodes  used,  Mitsubishi  ML  4402,  are  capable  of  providing 
5  mV  pulses  of  light  output  at  780  nm.  Their  threshold  current  is 
between  35  mA  and  40  mA  and  the  operating  current  is  about  50  mA.  The 
LUT  format  is  a  square  matrix  of  7x7  LDs  arranged  in  7  rows  and  7 
columns  as  shown  in  Figure  8.6.  A  double-sided  FR4  board  is  used  which 
allows  the  implementation  of  a  "non-additive*  scheme  in  which  rows  are 
connected  with  common  anodes  and  columns  are  connected  with  common 
cathodes.  In  an  alternative  scheme  the  cathodes  (anodes)  are  grounded 
and  the  anodes  (cathodes)  are  connected  to  both  row  and  column  lines. 
Note  that  in  this  "additive1  scheme  we  need  to  decouple  row  and  column 
lines  (via  the  use  of  diodes)  in  order  to  avoid  spread  of  the  drive 
current  pulse.  For  this  reason  and  because  of  the  more  complex 
interconnection  patterns  of  this  scheme,  we  have  decided  to  use  the 
"non-additive"  scheme. 

To  ensure  high-speed  operation  we  have  decided  to  use  ECL  compatible 
drivers.  Current  to  each  anode  row  is  supplied  through  the  35  Ohm 
resistors  R^-Ry  which  are  connected  to  +  5  V.  This  row  current  is 
diverted  from  the  laser  diodes  by  transistors  When  the  row 

inputs  to  these  transistors  are  at  ECL  logic  "1"  (j  +  1.2  V)  the 
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transistor  diverts  about  55  dA  from  each  row.  Each  cathode  column  is 
connected  to  ground  through  a  transistor,  V^14’  which  can  sink  about 
70  mA  when  an  ECL  logic  "1*  is  present  at  the  input,  and  through  a 
parallel  510  Ohm  resistor  which  provides  the  path  for  initial  bias 
current  to  the  diodes.  When  all  lasers  are  ’off",  i.e.,  rows  at  logic 
*1*  and  columns  at  logic  *0*,  current  through  row  resistors  R^-R^  is 
about  67  mA.  The  remaining  row  current  (~  12  mA)  not  diverted  by  Q^_Qy 
flows  through  that  row  of  lasers  via  the  column  resistors.  This 
initial  current  rises  to  ~  25  mA  when  a  row  transistor  is  turned  off. 

It  should  be  noted  that  the  sum  of  all  row  currents  not  diverted  by 
input  transistors  and  not  flowing  in  the  column  resistors  must  be 

sunk  by  column  transistors,  Qg-Q^,  when  that  column  is  addressed. 

The  output  from  the  laser  diodes  is  coupled  into  RFC  fiber-optic  7:1 
combiners  (see  Figure  8.5)  described  previously  in  Section  7.3.  The 
combiner  output  drives  a  high  speed  (rise/fall  times  £  1  ns)  pin  diode 
(Motorola  MOFD  1100)  which  in  turn  drives  a  high  speed  comparator 
(AM6687) .  The  comparator’s  dual  outputs  (positive  and  negative)  are 
capable  of  driving  any  of  the  transistors.  In  this  scenario,  we 

simulate  a  LUT  that  is  capable  of  driving  another  LUT. 

An  important  factor  for  high  speed  operation  is  the  operating  mode  of 
the  driver  transistors  which  need  to  provide  both  high  switching 

speed  and  adequate  current  levels.  Initially,  Motorola  2N5943 
transistors  were  used  in  a  switching  mode  which  offers  the  advantage  of 
large  current  flow.  In  this  operating  mode,  a  drive  pulse  applied  to 
the  base  quickly  saturates  the  transistor  and  thus  adequate  current 
flow  is  achieved.  The  problem  associated  with  this  approach  is  the 
slow  turn-off  times  due  to  the  relatively  long  storage  times.  Use  of 
Schottky  diodes  between  base  and  collector  reduces  the  turn-off  times 
but  not  enough  to  allow  GHs-type  operation.  With  this  technique  we  are 
able  to  demonstrate  about  110  MHs  rates.  An  alternative  approach  is  to 
use  the  transistors  in  a  current  mode  which  virtually  eliminates 
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the  problem  of  slow  turn-off  times.  Note,  however,  that  this  technique 
allows  relatively  small  amounts  of  current  flow  and  is,  thus,  suited  to 
relatively  low-current  situations.  Nevertheless,  we  have  implemented 
this  technique  using  Motorola  MRF  581  transistors  which  have  an  F^  of 
4  GHa  at  100  mA  collector  current  levels.  These  transistors  allow 
operating  current  levels  of  45-50  mA  and  about  2  mA  of  pre-bias  current 
which  is  shared  among  7  laser  diodes. 

Our  first  experiment  deals  with  the  effect  of  the  board’s  micro-strip 
lines  on  switching  speed.  For  this  purpose  only  two  laser  diodes  are 
connected  at  positions  (1,1)  and  (7,7).  These  positions  are  subject  to 
the  smallest  and  longest  propagation  delays,  respectively.  The  top 
trace  of  Figure  8.7  shows  the  250  MHz  RZ  ECL  waveform  used  to  drive  the 
laser  diodes.  This  is  derived  from  a  Hewlett-Packard  (Type  8082A) 
pulse  generator.  The  second  trace  of  Figure  8.7  shows  the  (7,7)  laser 
diode  response  to  that  waveform  (detected  through  an  MFOD  1100  pin 
diode)  when  the  laser  cathode  is  pulsed  and  the  anode  is  DC  biased. 
Similarly,  the  third  trace  shows  the  response  when  the  anode  is  pulsed 
and  the  cathode  is  biased.  Finally,  the  fourth  trace  shows  the 
response  when  both  anode  and  cathode  are  pulsed  simultaneously.  As 
these  data  show,  the  laser  diode  responds  to  at  least  250  MHz  RZ  or 
500  MHz  NRZ  data  rates.  (These  frequency  limits  are  dictated  by  the 
available  pulse  generator).  Note  the  reduction  in  pulse  width,  by 
about  a  factor  of  2,  when  the  anode  is  partially  or  fully  driven.  This 
behavior  is  not  well  understood  and  is  believed  to  be  associated  with 
the  specific  laser  diodes  used.  In  any  event,  this  effect  is  not 
believed  to  seriously  affect  the  LUT’s  performance.  Note  that  the 
(1,1)  laser  diode  shows  behavior  similar  to  the  (7,7)  laser  diode  with 
the  exception  of  the  lack  of  a  small  delay  (£  1  ns) .  These  results 
show  that  both  the  impedance  and  inductance  of  the  board’s  micro-strip 
lines  allow  for  at  least  moderately  high  switching  speeds. 


Figure  8.7  Mitsubishi  ML4402  laser  diode  switching  characteristics 
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For  the  next  experiment,  the  board  is  completely  populated  with  the  49 
laser  diodes  and  different  rows  and  columns  excited.  Figures  8.8b, 

8.8c  and  8.8d  show  the  output  intensity  spots  in  the  absence  of  the 
fiber-optic  connections  (LUT  geometry  is  shown  in  Figure  8.8a)  when  the 
(1,1),  (4,5)  and  (7,7)  rows  and  columns  are  excited.  These  figures 
clearly  demonstrate  the  concept  of  the  interlaced  electrode  LUT  and 
show  that  only  the  cross-point  laser  diodes  are  in  a  lasing  mode  (i.e., 
strong  intensity)  whereas  the  remaining  laser  diodes  are  more-or-less 
in  a  LED,  sub-threshold,  mode  (i.e.,  low  intensity).  Figures  8.9a  and 
8.9b  show  the  responses  of  the  7  laser  diodes  of  the  first  row  and  the 
7  laser  diodes  of  the  first  column  respectively,  when  driven  with  a 
250  kHz  RZ  waveform.  As  these  data  show,  all  lasers  have  a  clean 
response  to  at  leash  250  kHz  RZ.  Note  that  there  is  about  20% 
variation  in  the  output  light  level  because  of  the  variation  in 
threshold  level  and  current-power  characteristics  of  the  different 
laser  diodes.  In  Figures  8.10a  and  8.10b  we  show  the  responses  of  the 
7  laser  diodes  of  the  sixth  row  and  the  7  laser  diodes  of  the  sixth 
column.  Once  again  we  can  see  that  all  diodes  respond  to  the  drive 
waveform  of  250  kHz  RZ.  Note,  however,  that  the  response  is  not  as 
clean  as  that  of  Figure  8.9.  Specifically,  there  is  ringing  and 
undershoot  that  increases  with  distance  along  the  strip  line.  This 
problem  is  not  well  understood  but  is  believed  to  be  at  least  partially 
due  to  drive  pulse  reflections  caused  by  imperfect  termination  of  the 
strip  line.  This  is  a  difficult  problem  to  model  and  solve  because  of 
the  dynamic  impedances  of  the  laser  diodes  involved  and  is  compounded 
by  the  fact  that  we  are  dealing  with  laser  diodes  of  different 
characteristics.  It  is  important  that  this  issue  be  further  studied  in 
detail  to  provide  a  solution  which  will  eliminate  the  possibility  of 
false  responses  by  ringing. 

Finally,  we  have  measured  the  "noise"  due  to  the  response  of  the  laser 
diodes  which  are  connected  to  the  rows  and  columns  being  exercised,  but 
which  are  not  located  at  the  cross-points.  This  is  an  important 
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Figure  8.9  Responses  of  the  7  laser  diodes  of  the  first  row  (a)  and  of 
the  7  laser  diodes  of  the  first  column  (b) . 
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measurement  because  it  provides  information  concerning  the  detector 
threshold  level  necessary  in  order  to  avoid  false  responses. 

Figure  8.11  shows  the  responses  of  the  7  laser  diodes  of  column  4  when 
the  (4,3)  laser  diode  is  being  exercised.  As  Figure  8.11  suggests  the 
response  is  rather  small,  about  20  db  below  the  response  of  the  (4,3) 
laser  diode,  which  translates  to  rather  non-critical  thresholds. 

Similar  results  are  obtained  when  different  rows  and  columns  are 
excited  and,  thus,  this  is  not  a  serious  problem. 

In  conclusion,  we  have  shown  that  a  LUT  constructed  with  discrete 
components  can  operate  to  at  least  250  MHz  RZ  or  500  MHz  NRZ  data 
rates.  We  expect  substantially  higher  rates  from  a  hybrid  or 
monolithic  package.  However,  we  should  emphasize  that  a  complete 
analysis  of  the  driving  circuit  is  necessary  to  enable  properly 
terminated  lines  to  be  designed  that  will  eliminate  the 
ringing/ undershoot  problems . 

I 

8.5  Binary-to-Residue  Conversion  (B/R) 

A  simple  way  to  understand  B/R  conversion  is  through  a  simple  numerical 
example.  Consider  the  binary  representation  of  the  number  255: 

255  =  1  (27)  ♦  1  (26)  +  ...  ♦  1  (2°)  =  11111111  (binary) 

Because  of  the  presentation  of  algebraic  rules  in  RNS  (i.e., 

<r+s>  =  <r>  +  <s>  and  <rs>  =  <r><s>,  where  <r>  is  the  residue 
representation  of  number  r) ,  the  residue  representation  of  255  is  equal 
to  the  residue  representation  of  the  sum  of  the  products  of  the  binary 
bits  and  the  powers  of  2  they  correspond  to.  For  example,  the  residue 
of  255  modulo  7  is 


Figure  8.11  Responses  of  the  7  laser  diodes  of  column  4,  when  the 
(4,3)  laser  diode  is  excited. 
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<255>7  =  <1«27>7  +  <1*26>7  ♦  . . .  ♦  <1*2°>7 
and  the  total  prebias  power 

=<2+l+4+2+l+4+2+  1>7 

=  <17>7  =3  (8 

From  Equation  (8.2)  we  see  that  for  an  efficient  B/R  conversion,  we 
need  to:  (1)  multiply  each  input  bit  by  the  residue  representation  of 
its  corresponding  power  of  2  and  (2)  sum  all  the  residues. 

Fortunately,  all  these  operations  can  be  achieved  in  a  pipelined 
architecture  using  the  LUT  technology  of  the  previous  section.  An 
example  of  such  an  implementation  is  shown  in  Figure  8.12  for  an  8-bit 
A/D  converter  and  modulo  7.  Each  output  bit  of  the  A/D  converter  is 
connected  to  a  binary  switch.  Pairs  of  binary  switches  are  connected 
to  2x2  LUTs.  Pairs  of  2x2  LUTs  are  connected  to  4x4  LUTs  which  in  turn 
are  connected  to  7x7  LUTs.  Each  2x2  LUT  provides  the  residue  of  the 
sum  of  two  bits,  depending  on  whether  each  bit  is  "1"  or  "0*,  the 
binary  switch  activates  one  of  the  two  possible  electrodes  (each  bit 
can  be  0  or  Y  in  residue,  Y  being  dependent  on  the  power  of  2  each  bit 
corresponds  to  and  the  modulo  we  use) .  Once  two  inputs  are  present  in 

the  2x2  LUT,  one  output  (out  of  four  possible  outputs)  is  produced. 

7  6  7  6 

For  example,  for  the  case  of  bits  2  and  2°,  if  we  have  1:2*  and  1:2 
the  LUT  output  is  3.  Two  such  outputs  now  drive  4x4  LUTs  whose  outputs 
drive  the  7x7  LUT.  Note  that  this  is  a  pipelined  process  and  thus  at 
each  clock  cycle  a  new  residue  is  produced. 

For  most  practical  residue  processors,  moduli  that  exceed  11  are 
required.  In  these  cases  2x2  and  4x4  LUTs  are  always  used.  The 
dimensions  kxk  of  the  third  LUT  depend  on  the  modulo  k  we  use.  For 
example,  for  modulo  13  we  need  a  13x13  LUT.  The  number  of  2x2,  4x4  and 
kxk  LUTs  we  need  depends  on  the  number  of  input  binary  bits.  Table  8.1 
shows  the  number  of  2x2,  4x4  and  7x7  LUTs  we  need  for  8,  16  and  32 
input  bits  when  modulo  7  is  used. 
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Table  8.1 


N  (bits) 

2x2 

4x4 

7x7 

8 

4 

2 

1 

16 

8 

4 

3 

32 

16 

8 

7 

From  the  above  we  can  conclude  that  B/R  conversion  can  be  achieved  in  a 
very  simple  way  by  means  of  LUT  technology.  The  process  can  be  of  high 
speed  because  a  pipelined  architecture  can  be  used. 


8.6  Residue-to-Binary  Conversion  (R/B) 


Residue-to-binary  conversion  can  be  achieved  by  converting  into  the 


mixed  radix  system1  which  allows  for  the  use  of  LUTs. 


The  principles  of  operation  are  pictured  in  Figure  8.13  for  a  system  of 
four  moduli,  m^  through  m^,  with  residues  Tj  through  r^.  The  four 
residues  are  clocked  in  simultaneously  with  rg,  r^  and  r^  going  to  LUT 
accumulators  and  r^  being  fed  through  an  additive  inverter  (for 
complement  calculation)  to  each  LUT  accumulator;  r^  also  passes  through 
the  system  as  output  a^. 


The  subtractions  are  performed  to  the  respective  moduli  and  the  results 
are  passed  to  the  LUTs.  There  they  are  modulo  multiplied  by  r^  and 
passed  on  to  the  next  stage.  The  second  stage  performs  similarly  to 
the  first  except  that  now  ag  is  subtracted  from  the  others  and  is 
passed  to  the  output.  The  process  continues  through  a  cascade  until 
the  final  output  is  triggered,  in  this  case  by  the  arrival  of  a..  The 


result  of  the  decoding  process  must  then  be  calculated  according  to  the 
expression 


I  =  a,  + 


1  +  V“l  *  a3*“lm2  +  WV“3  + 


where  I  denotes  that  the  result  is  an  integer  and  the  sequence 
continues  for  as  many  terms  as  there  are  residues. 


An  example  of  this  procedure  is  shown  in  Figure  8.14.  In  this  example 
we  use  4  moduli  2,5,7  and  Q.  The  input  residue  representation  is 
(1,3, 4,0)  which  corresponds  to  333.  Rectangular  blocks  show  the  LUTs 
and  small  squares  the  output  detectors.  The  values  within  the  detector 
blocks  correspond  to  the  results  of  the  wiring  maps,  for  the  additive 
inverters  (-r^)mj,  when  applied  to  the  original  input  values  (shown 
next  to  the  arrows) .  One  can  understand  the  operation  by  tracing  the 
heavily  outlined  squares  in  each  block.  In  the  bottom  of  Figure  8.14 
we  show  the  conversion  result  which  is  333. 

21 

Tai  et  al.  have  suggested  a  pipelined  version  of  the  R/B  convecter. 

In  their  approach  the  coefficients  a^,  ag,  a^,  etc.,  are  delayed  so 
that  they  all  appear  at  the  same  time.  Figure  8.15  shows  the  pipelined 
version  of  the  R/B  converter  of  Figure  8.14.  Note  that  delays  are 
implemented  using  technology  similar  to  that  used  in  LUTs;  i.e.,  sets 
of  laser  diodes,  fibers  and  detectors.  Through  this  pipelining 
structure,  we  can  have  multiple  sets  of  residues  following  each  other 
through  the  R/B  converter  with  a  time-gap  of  only  one  cycle  and  thus 
high  speed  conversions  are  possible.  Note  that  the  total  number  of 
LUTs  is  2N-2  where  N  is  the  number  of  moduli  we  use. 


Let  us  consider  now  a  practical  implementation  of  Equation  8.3, 
assuming  that  a  pipelined  R/B  decoder  is  available.  The  first 
operation  we  need  to  do  is  to  multiply  each  coefficient  a^  by  the 
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proper  m^,mj....m^  product  products  are  known  a  priori) .This 

operation  can  be  accomplished  via  high  speed  digital  ROM  LUTs  (see 
Figure  8.16),  which  are  activated  depending  on  which  value  of  the  a^ 
coefficients  is  ’on."  Next,  we  fan-in  all  outputs  of  each  a^  ROM  LUT 
and  add  them  in  pairs  via  a  high-speed  digital  adder.  Subsequently,  we 
sum  the  results  from  the  adders  to  obtain  the  final  output  I.  In 


Figure  8.16  we  present  an  example  of  this  operation  for  the  R/B 
converter  of  Figure  8.15.  It  is  important  to  note  that  the  approach  of 
Figure  8.18  is  fully  pipelined  which  means  high  speed  conversion 
capability.  Note  that  the  total  number  of  ROM  LUTs  we  need  is  equal  to 
the  sum  of  the  N  moduli  we  use,  whereas  the  total  number  of  high-speed 
adders  is  equal  to  N-l. 

From  the  above  we  see  that  R/B  conversion  can  be  achieved  in  a  very 
simple  pipelined  way  with  efficient  utilization  of  LUT  technology. 

8.7  Hardware  Minimization 

One  of  the  issues  associated  with  position-coded  LUTs  is  their 

complexity  in  terms  of  numbers  of  gates  (optical  sources  in  our  cases) . 

This  is  because  the  number  of  gates  N  grows  as  the  square  of  the 

S 

modulo  m.  To  understand  this,  consider  a  particular  residue  example, 
specifically,  a  residue  equivalent  of  an  8-bit  multiplication,  which 
requires  moduli  3,5,7,11  and  13.  In  this  case,  we  need  a  total  of  437 
gates  or  laser  diodes.  For  computationally  linear  signal  processing 
problems  this  is  more-or-less  acceptable  but  for  computationally 
non-linear  applications  it  becomes  a  serious  limitation.  For  example, 
consider  the  dynamic  range  needed  in  order  to  solve  a  system  of  linear 
equations  of  dimension  12.  Assuming  that  the  maximum  value  of  the 
determinant  is  128,  we  find  (see  Section  9.2,  Equation  9.5)  that  the 
dynamic  range  needed  is  of  the  order  of  5  x  10  (about  105  binary 
bits)!  To  accomplish  this,  we  need  the  moduli  9,11,13,17,19,23,25, 
29,31,37,41,43,47,49,53,59,81,84,67,71,73.  In  this  case  just  one 
residue  multiplier  requires  a  total  of  42,452  laser  diodes. 


Obviously,  for  this  kind  of  problem  the  complexity  grows  rapidly.  In 

an  effort  to  minimize  this  problem,  we  have  sought  solutions  at  both 

tne  LUT  and  processor  levels.  In  the  first  part  of  this  section,  we 

discuss  our  attempt  to  minimize  the  number  of  laser  diodes  per  LUT  and 

in  the  second  part  we  discuss  a  general  residue  scaling  approach 
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suggested  by  J.  N.  Polky  et  al. 

One  can  accept  that  the  LUT  gates  can  be  reduced  once  it  is  realized 

that  there  exist  symmetry  planes  in  both  multiplier  and  adder  LUTs. 

Consider,  for  example,  the  LUTs  of  Figure  8.3.  In  the  multiplier  LUT 

one  can  see  that  the  symmetry  line  passes  from  (0,0)  to  (6,6) 

coordinates.  Similarly,  the  adder  LUT  is  symmetric  about  the  line  from 

(0,6)  and  (6,0).  Thus,  if  we  route  some  of  the  inputs  properly,  about 

one-half  of  the  diodes  will  be  needed.  One  such  example  of  a  reduced 

LUT  is  shown  in  Figure  8.17  for  the  case  of  a  modulo  7  multiplier  LUT. 

It  can  be  seen  that  we  need  28  diodes  versus  49  for  the  general 

implementation.  This  is  achieved  via  proper  interconnection  of  similar 

output  channels.  In  the  process,  however,  we  are  forced  to  use 

multiple-drivers  per  grid,  whereas  for  the  non-reduced  case  we  need 

only  single  drivers.  In  addition,  we  have  increased  the  average  number 

of  interconnections  (per  diode)  by  a  factor  of  2.  Thus,  we  see  that 

the  reduction  of  gates  is  accompanied  by  an  increased  complexity  in  the 

interconnect  pattern.  In  fact,  one  can  easily  show  that  as  the  number 

of  diodes  decreases,  the  number  of  interconnections  per  diode 

increases.  This  relationship  is  shown  in  Figure  8.18  where  we  plot  the 

number  of  laser  diode  rows  (Ng)  versus  the  number  of  interconnections 

per  diode  (n^)  for  modulo  mc>  From  this  plot  we  can  see  that,  in 

principle,  a  LUT  made  out  of  1  row  of  gates  (total  of  m  diodes)  is 

c 

possible;  however,  each  diode  needs  at  least  2  m  interconnections. 

c 

Thus,  we  need  to  optimize  the  relative  numbers  of  diodes  and 
interconnections;  one  method  of  accomplishing  this  is  to  assign  cost 
functions  to  both  diodes  and  interconnections.  For  our  experimental 
LUT  boards  a  simple  analysis  shows  that  the  multiple  drivers  and  the 


Figure  8.18  Number  of  Laser  Diode  Rows  (N^  versus  Number  of 
Interconnections  per  Diode  (n?)  for  Modulo  n»c 


more  complicated  interconnection  pattern  result  in  a  more  expensive  and 

more  complicated  LUT  and,  thus,  the  unreduced  LUTs  are  preferable. 

Note,  however,  that  this  analysis  is  performed  for  a  discrete  component 

LUT  where  the  physical  length  of  the  electrode  microstrip  lines  is  of 

the  order  of  10  cm.  In  a  realistic  scenario  where  LUTs  are  made  out  of 

integrated  LED  or  LD  arrays  and  the  microstrip  lengths  are  considerably 

smaller  (**  1  cm),  a  similar  analysis  may  yield  different  results.  In 

any  event  our  present  feeling  is  that  a  reduction  by  a  factor  of  2  is 

about  the  limit  of  this  approach.  The  result  of  such  a  reduction  is 

obviously  not  very  significant  and,  thus,  we  need  to  develop  LUT 

2 

architectures  in  which  N  grows  at  a  slower  rate  than  m  . 

s 

Hardware  reduction  can  also  be  accomplished  at  the  processor  level  by 

using  techniques  which  include  efficient  algorithm  selection  (to  keep 
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the  low  dynamic  range)  and  scaling.  The  choice  of  algorithm  is 

application  dependent.  For  the  APAE  case  the  only  present  alternatives 

are  Gram-Schmidt  type  algorithms  and  possibly  Gauss  elimination 

techniques  since  these  do  not  require  divisions  or  square  roots  (which 

are  problematic  for  RNS) .  Scaling  is  possibly  a  more  profitable 

approach  since  it  can  be  applied  to  a  number  of  applications.  The 

scaling  technique  we  have  studied  is  that  suggested  by  J.  N.  Polky 
22 

et  al  which  offers  the  advantages  of:  (1)  scaling  for  both  positive 

22 

and  negative  numbers  and  (2)  pipelining.  In  this  scaling  technique 

we  first  prescale  by  adding  M/2,  next  scale  by  performing  about  N+l 

additions  per  module,  and  then  post  scale  by  subtracting  various 

biases.  The  technique  requires  that  an  extra  module  is  used.  For  the 

prescaling,  one  wiring  map  per  module  is  required  and  its  output  is 

fanned  out  to  N+l  channels.  For  the  scaling  operation,  each  module 

requires  N  maps  whose  results  need  to  be  added  and,  thus,  N-l  adder 

LUTs  (per  module)  are  needed  together  with  one  more  adder  LUT  for  a 

combined  3caling/post-scaling  calculation.  Thus,  the  total  number  of 
2 

LUTs  needed  is  N  +  N.  It  is  of  interest  to  examine  the  accuracy  and 
range  of  this  scaling  procedure  so  that  the  "savings"  (from  the 


scaling)  and  1  expenses"  (due  to  N  +  N  LUTs)  can  be  compared.  To 
achieve  this  we  use  a  computer  program  which  calculates  (in  residue) 
the  scaling  results  for  a  variety  of  scale  factors  S  and  residue 
dynamic  ranges  k.  In  Tables  8.2,  8.3  and  8.4,  we  show  the  worst-case 
scaling  accuracy  (SA)  and  the  expected  accuracy  (EA,  obtained  by 
performing  a  straightforward  division  of  the  number  to  be  scaled  by  S) 
for  different  residue  ranges  M  (k  =  27,  24  and  20  binary  bits 
respectively)  and  scaling  factors  S.  For  each  case,  512-65,000  numbers 
are  scaled  (S  =  0.59  -  0. 00019  of  k)  and  the  lowest  SA  is  reported. 

From  the  above  tables  we  see  that  SA  varies  as  a  function  of  both  k  and 
S,  with  the  latter  being  the  most  critical.  This  implies  that  for  a 
given  scaling  accuracy,  the  scaling  factor  cannot  exceed  a  given 
percentage  of  k.  For  example,  for  k  =  24  (Table  8.3)  if  a  SA  of  9  bits 
is  required,  then  S  cannot  exceed  0.059  of  k. 

From  Tables  8. 2-8. 4  we  see  that  if  a  SA  of  9-10  bits  is  desired,  the 
maximum  scaling  factor  S  cannot  exceed  0.0059  k.  This  implies  that,  in 
principle,  with  k  values  of  about  20,000  we  can  handle  much  larger 
dynamic  range  problems.  This  issue  needs  to  be  studied  further  in 
conjunction  with  a  specific  algorithm.  Only  in  such  a  context  can  we 
evaluate  whether  the  benefits  of  scaling  are  significant. 

Unfortunately,  the  Gram-Schmidt  algorithm  of  Section  9  cannot 
incorporate  this  scaling  technique. 

8.8  System  Characteristics  for  a  Sauare  Systolic  Residue  System 


In  this  section  we  discuss  an  example  of  a  residue  LUT  processor, 
namely,  a  systolic  processor  for  matrix-matrix  multiplication.  This 
particular  example  is  chosen  because  a  large  number  of  APAR  algorithms 
can  be  expressed  in  terms  of  matrix-matrix  multiplication.  Figure  8.19 
shows  a  typical  configuration  for  a  square  systolic  array.  The  array 
consists  of  nxn  kAU  LUTs,  each  of  which  consists  of  cascaded  multiplier 
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and  adder  LUTs.  To  insure  proper  data  propagation,  local  interconnects 
between  adjacent  kAU  LUTs  involve  a  fixed  delay  which  is  equal  to  the 
delay  of  a  LUT.  For  the  purposes  of  this  example,  we  assume  a  16x16 
real  matrix  multiplication  scenario  in  which  the  elements  of  the  input 
matrices  cannot  exceed  8  bits.  Each  kAU  is  required  to  add  16  products 
each  consisting  of  16  bits.  Thus,  the  dynamic  range  requirement  is 
20  bits  or  1,048,576.  To  handle  this  we  use  the  moduli  7,8,9,11,13  and 
17  which  yield  an  II  of  1,225,224.  Note  that  in  this  example  the 
dynamic  range  of  the  multiplier  can  be  handled  by  moduli  7,8,9,11  and 
13;  however,  we  forced  to  use  the  extra  modulo  17  because  of  the 
dynamic  range  requirement  of  the  adder.  For  the  systolic  processor  we 
need  2  x  18  x  16  =  512  LUTs  per  modulo  layer  giving  a  total  number  of 
512  x  6  =  3,072  LUTs.  The  total  number  of  LDs  in  these  LUTs  is  about 
396,000.  Additional  LUTs  are  required  for  B/R  and  K/B  conversion.  We 
need  a  total  of  6  x  16  x  2  =  192  B/R  converters  each  consisting  of 
7  LUTs.  The  average  number  of  LDs  per  B/R  is  about  170  and,  thus,  the 
total  number  of  LDs  in  the  B/R  converters  is  about  32,000  (i.e.,  about 
8%  of  the  number  if  the  processor) .  The  number  of  R/B  converters  is 
determined  by  the  read-out  arrangement.  Let  us  assume  the  use  of  16 
R/Bs  in  order  to  read  out  the  results  in  a  pipelined  fashion.  Each  R/B 
needs  15  LUTs  which  contain  about  2,700  LDs  and  thus  the  total  number 
of  LDs  for  the  R/Bs  is  about  43,000  (i.e.,  about  11%  of  the  number  in 
the  processor).  Thus,  we  see  that  the  total  number  of  LDs  is  about 
470,000.  By  comparison,  a  fully  digital  16  x  16  array  processor 
requires  a  total  of  256  kAUs.  Each  kAU  requires  about  76  Full  Adders 
each  of  which  requires  about  10  gates.  Thus,  the  total  number  of  gates 
is  190,000  which  is  about  40%  of  the  complexity  (in  terms  of  gates)  of 
the  residue  system.  Thus,  we  conclude  that,  for  this  particular  square 
systolic  processor,  the  residue  LUT  implementation  has  twice  the 
complexity  of  an  electronic  digital  implementation.  However,  this 
situation  can  change  if  the  complexity  of  the  LUT  is  reduced.  We  also 
note  that  the  specific  application  which  we  have  analyzed  does  not 
favor  the  residue  system.  This  is  because  the  multiplier  LUTs  have  an 


increased  complex*./  dictated  by  the  dynamic  range  of  adder  LUTs  which 
requires  the  additional  modulo  17  LUTs.  Without  these  LUTs  the  number 
of  gates  in  the  processor  is  reduced  to  about  270,000,  comparable  to 
that  of  a  fully  electronic  processor.  This  emphasizes  the  necessity  of 
choosing  applications  which  are  well  suited  to  residue  implementation. 

The  MS  of  the  system  is  equal  to  that  of  the  LUTs  which  we  expect  to  be 
in  the  range  1-6  GHz.  To  calculate  the  SE  of  the  processor  we  perform 
an  analysis  similar  to  the  one  in  Section  8.2.  Thus,  assuming  1  GHz 
operation  and  LUTs  implemented  with  laser  diodes,  a  SE  of  the  order  of 
2-3  x  10  M-A/sec.  W  can  be  expected.  These  values  of  MS  and  SE  for 
the  residue  processor  are  superior  by  at  least  an  order  of  magnitude  to 
those  of  a  GaAs  implemented  electronic  processor. 

The  performance  of  the  residue  processor  may  be  improved  considerably 

if  different  Electro-Optic  technology,  for  example  SEED  (Self -Electro- 
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Optic-Effect)  devices,  is  developed  for  LUT  implementation.  These 

devices  exhibit  strong  changes  in  optical  absorption  (transmission) 

dependent  on  the  intensity  of  the  incident  light.  These  changes  are 

due  to  changes  in  internal  electric  field  distribution  that  occur  in 

response  to  a  variation  in  carrier  concentration  induced  by  optical 

absorption.  The  response  time  of  this  process  is  estimated*  to  be  as 
-13 

short  as  2  x  10  sec.  If  this  technology  can  be  implemented  in  LUTs 

then  improvements  in  MS  and  SE  by  2  orders  of  magnitude  can  be 
* 

expected.  In  this  case,  the  performance  of  the  residue  processor  would 
be  far  superior  to  that  attainable  from  electronic  processors, 
including  GaAs. 


RESIDUE  LUT  IMPLEMENTATION  OF  GRAM- SCHMIDT  APPROACH 


FOR  THE  SOLUTION  OF  LINEAR  SYSTEMS 


Introduction 


In  the  previous  chapter,  we  showed  how  residue  LUT  techniques  can  be 
used  to  perform  simple  arithmetic  operations.  In  this  chapter  we 
discuss  the  residue  LUT  implementation  of  a  variant  of  the 
Gram-Schmidt^  approach  for  the  solution  of  systems  of  linear  equations. 
As  such  this  technique  is  directly  applicable  to  the  APAR  problem.  In 
the  next  Section  0.2,  we  discuss  the  problem  of  solving  systems  of 
linear  equations  using  residue  arithmetic.  In  Section  9.3  we  present 
the  basics  of  the  Gram-Schmidt  variant  along  with  a  numerical  example. 
Section  0.4  contains  the  LUT  design  for  the  residue  implementation  of 
the  technique.  Finally,  in  Section  9.5,  we  discuss  the  characteristics 
of  the  LUT  processor. 


0.2  Residue  Resolutions  of  Linear  Systems 


A  significant  advantage  of  RNS  involves  its  considerably  greater 
flexibility  than  DMAC  or  BPAM.  Addition,  subtraction,  multiplication 
and  some  forms  of  division  can  be  performed  in  easily-implemented  LUTs, 
and  this  allows  the  efficient  implementation  of  rather  complicated 
signal  processing  algorithms,  such  as  the  Gram-Schmidt^ 
orthogonalisation  approach,  in  solution  of  large  linear  systems  of 
equations . 

These  considerations  strongly  recommend  the  exploitation  of  RNS  in  the 
APAR  application,  which  in  certain  formulations  requires  solutions  of 
systems  of  the  form 


Ct  =  s 


where  Cy  is  the  covariance  matrix,  is  the  adaptive  weight  vector, 
and  s  is  the  steering  vector.  Adoption  of  RNS  to  solve  Eq.  (9.1) 
precludes  consideration  of  the  QR  and  certain  other  often-used 
algorithms  which  are  inherently  tied  to  real,  as  opposed  to  integer, 
calculations.  This  is  not  the  case,  however,  with  the  modified  version 
of  the  Gram-Schmidt  orthogonalization,  as  we  shall  illustrate  in  detail 
in  later  sections.  Before  we  address  these  particular  algorithms  ,  it 
is  appropriate  to  look  more  generally  at  the  use  of  residue  arithmetic 
in  solving  sets  of  linear  equations. 

To  begin  with,  we  assume  that  the  elements  of  C  and  s  are  all 

v  —v 

integers;  solutions  in  &NS  can  also  be  computed  for  the  more  general 

case  where  data  are  given  as  Gaussian  integers,  but  we  avoid  this 

formulation  for  simplicity.  Because  of  the  manner  in  which  the 

covariance  matrix  is  derived  from  noisy  signals,  it  is  relatively  safe 

to  assume  that  Cy  is  non-singular.  With  Cy  J  representing  the  adjoint 

of  C  ,  we  know  that 
v 

y  =  C^s 
*■  V  -v 

is  the  integer  solution  of 

=  (detCy)  Sy  (9.3) 

Thus,  we  can  write 

Wy  =  jr/detCv  (9.4) 

which  shows  that  division  can  be  postponed  until  the  last  step  of 
computation.  Newman  has  proved  that  Eq.  (9.3)  can  be  solved  using 
residue  arithmetic  and  we  highlight  some  important  steps  of  his  proof 
here.  Let  Z  be  the  ring  of  all  integers  and,  for  a  given  integer  k  >  0, 


(9.2) 


let  represent  all  integer  multiples  of  k.  Basically,  use  is  made  of 
the  ring  Z^  of  integers  modulo  M;  it  is  the  ring  of  residue  classes  of 
the  form  r  ♦  Z  M  =  :  <r>y.  Then  Zy  =  {<0>^,  <l>y,  and  the 

well-defined  rules  <r>^T<s>^  =  <rTs>^  with  T  =  +  or  x,  make  Z^  a 
commutative  ring.  These  rules  justify  computations  involving  finite 
sums  and  products;  for  example,  if  B  is  a  square  matrix,  then 
<detB>y  =  det<B>y.  In  solving  linear  systems  of  algebraic  equations, 
Newman  has  shown  that  the  principal  modulus  M  must  bound  certain 
parameters  involving  the  determinant  of  and  the  second  member  sy . 
Using  Hadamard’s  inequality  and  Cramer’s  rule  we  have  derived  the 
following  slightly  improved  inequality  for  a  bound  on  M: 

M  >  2nn/2KQ-1max  {k,b>  (9 


where  K  =  max.  .  |c  |  and  b  =  max.|s  |.  In  the  following  section 
1,J  Vij  J  V3 

examples  are  presented  which  show  that  this  bound  is  conservatively 
large,  although  improving  it  is  difficult. 


9.3  Gram-Schmidt  Variant 

The  Gram-Schmidt  approach  is  applicable  to  the  APAR  problem  because  it 
allows  operations  involving  rectangular  data  matrices  A,  where 
=  A*A.  Here,  we  write 

A*Aw  =  s  (£ 

— v  —v  ' 

which  we  solve  for  w^.  This  is  done  by  post-multiplying  A  by  a 
sequence  of  n  x  n  matrices  to  produce  a  new  sequence  of  m  x  n  matrices 
where  each  succeeding  matrix  in  the  latter  sequence  has  its  number  of 
orthogonal  columns  increased  by  1 .  With  representing  the  n  x  n 
identity  matrix, 


V 


If  &2  orthogonal iases  the  second  column  of  AEj  to  its  first,  then 


has  two  orthogonal  columns.  This  sequence  continues  until  we  have 
E  =  Ej...Eq  and  Q  =  — QqI  where  E  is  upper  triangular  because  all  of 

its  constituents  are  used  and  Q  has  orthogonal  columns,  and, 
consequently,  Q*Q  (Q*  is  the  Hermitian  conjugate  of  Q)  is  a  diagonal 
matrix  we  call  A.  This  procedure  differs  slightly  from  the  usual 
Gram-Schmidt  method  in  that  we  use  orthogonalisation  without 
normalising  the  vectors  in  order  to  obviate  the  excessive  growth  of 
integers  in  the  computation  as  well  as  to  postpone  division  until  the 
last  step  and  remove  the  necessity  to  extract  roots.  The  matrix 
differs,  at  most,  from  the  identity  by  possessing  integers  in  the  upper 
half  of  its  (k+l)st  column.  In  summary,  we  now  have 


AE  =  Q 


and  multiplying  this  equation  on  the  left  by  its  Hermitian  conjugate 
gives 


EVAE  =  =  A  , 


A*A  =  E*'1  AE'1 


w  =  EA'Vs  . 


l: 


i 


\ 

t 

C' 

v' 


K 

1 


To  perfora  this  computation  in  residue  arithmetic  we  must  also  evaluate 
the  determinant  of 
number  is  given  by 


the  determinant  of  in  each  modulus.  Pros  Eq.  (9 .11)  we  see  that  this 


detC^  =  detA/|detE|‘ 


(9.13) 


where  only  the  products  of  diagonal  elements  in  E  and,  of  course  in  A 
need  be  computed.  These  computations  are  illustrated  with  an  example. 
Before  proceeding  with  this  example,  we  should  note  that  the  growth  of 
integers  in  this  algorithm  can  be  enormous  but  that  it  is  not  the 
individual  collection  of  integers  appearing  in  the  intermediate  steps 
of  the  computation  that  must  be  bounded  by  the  principal  modulus  — 
instead,  we  postulate  that  it  is  the  bound  on  the  coefficients  in  the 
expansions  of  various  terms  that  matters. 

Consider  now  a  4x3  system  corresponding  to  Eq.  (9.1)  and  given  by 
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(9.14) 


with  2^  =  [x,y,s]^.  As  we  described  above,  the  columns  of  A  are 
orthogonal ised  through  application  of  the  matrix 


E  = 


1 

0 

0 


-1  -1 
1  0 
0  3 


(9.15) 


9-6 


to  produce 


and 


Q  = 


10-1 

0  13 
1  -1  2 
1  1  -1 
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(9 


Using  Eq.  (0.13)  we  find  that 


detC 

V 


15 


(9 


and  to  insure  that  an  integer  solution  exists,  we  multiply  s  = 


(l,l,-l)u  by  15  and  solve  for  w’  =  (x^y',*')^.  The  inverse  of  C  is 


so  that  we  get 


C"1  =  BA"V  = 


7  =  C 
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e-e 


which  by  Eq.  (9.4)  yields  w^  =  (3/5,  0,  -4/5)  .  Note  that  is 
symmetric  which  is  expected  because  C^,  the  covariance  matrix,  is 
symmetric. 


Let  us  now  course  through  the  same  examples  using  residue  arithmetic. 
From  Eq.  (9.5)  we  find  that  M  should  be  £  2806,  which  can  be  satisfied 
with  moduli  of  5,  7,  8  and  11;  in  this  particular  example,  using  2  or  3 
as  a  modulus  would  cause  <E>  to  be  singular,  a  circumstance  which  cannot 
be  tolerated.  We  have  noted  previously  that  in  some  cases  smaller  M 
values  are  sufficient  for  providing  the  correct  answer,  but  this  is  not 
always  true  and  one  thus  has  to  select  an  M  at  least  equal  to  the  upper 
bound.  To  demonstrate  that  smaller  principal  moduli  can  suffice  we 
choose  to  solve  our  4x3  example  using  only  the  two  moduli  7  and  11. 

In  modulo  7  we  get 


6  6 


X 


9 


=  <9>77 

«2>7,<10>n) 

n 

Z'. 

o 

V 

«2>7,<10>u) 

=  <6S>77 

where  the  last  vector  components  involve  the  representation  -38,-37, 

. . . ,-1,0, 1,2, . . . ,38  as  the  elements  of  Z 77>  Comparison  of  Eq.  (9.28) 
and  (9.20)  shows  that  the  RNS  approach  indeed  provides  the  correct 
number . 


Caution  must  be  employed  because  certain  moduli  can  yield  singular 
equations  or  degenerate  inner  products  as  a  result  of  the  fact  that,  in 
a  quotient  ring  Zffl,  <x  x>  =  0  for  a  non-zero  vector  x.  This  computation 
has  the  disadvantage  over  the  straightforward  Gram-Schmidt 
orthonormalization  and  Q-R  algorithms  in  that  it  generates  integer 
growth  in  the  computation  at  an  alarming  rate. 


9.4  LUT  Implementation  of  Gram-Schmidt  Approach 


The  implementation  of  the  modified  Gram-Schmidt  procedure  described  in 
this  section  is  based  solely  on  the  use  of  residue  LUTs  and  delays.  Our 
objective  is  to  invert  matrix  A  in  order  to  solve  a  linear  system  of  the 
form  Ax  =  b,  by  expressing 


AE  =  Q 


where  E  is  a  triangular  matrix  and  Q  has  orthogonal  columns.  Once  E  and 
Q  are  known,  we  proceed  to  calculate 


<T  Q  =  A 


We  are  now  ready  to  invert  A  by  calculating 


C  =  A-1  =  EQ  l. 


(9. 


The  first  part  of  the  implementation  deals  with  the  calculation  of  the 
column  vectors  of  Q,  which  we  note  by  w^ ,  and  the  coefficients  a^/J 
...,i  =  1,2,  j  =  i+1,  k  =  j+1,  ...,  which  represent  the  columns  (in 
order)  of  the  matrices  E^.  A  simple  analysis  shows  that  each  of  the  w^ 
can  be  expressed  as 


“i  =  ^1=1  +  ^2-2  +  ^3=3  +  •  •  •  ♦'i-A-l  *  ^i-i 


(9- 


where  p,  is  the  element  of  the  i*^  column  of  the  E.  matrix  and  u.  is 
the  i  column  of  A.  To  calculate  the  w^,  we  need  to  know  the 

coefficients  p^,  i  =  1,2,3,  - ,  i-1,  as  well  as  all  w^,^,  •••»  2^ 

This  implies  that  the  vector/coefficient  calculation  is  serial  and 
alternates,  i.e.,  we  first  calculate  w^  =  Uj,  then  we  calculate  a^, 
followed  by  Wg  then  ^^^2^3’  e^c’  The  parametric  form  of  the  p^ 
coefficients  can  be  written  as 


*1 


=  1 


=  <2i2i>  /  <2iHi> 


(9.34) 


b2  =  *1  <22  V  /  <2222> 


'■i-i  =  H  <iiSi>  / 


where  <  >  designates  the  vector  dot  product  and  the  superscript  bar 
(vinculum)  means  complement.  Eqs.  (9.34)  reveal  that,  with  the 


exception  of  p^  and  p^,  the  coefficients  p^  have  a  rather  regular  form 
which  is  a  function  of  p^,  (<2j£2jc>)  ^  and  <w^U£>.  We  thus  need  a  LUT 


set  (LUTS)  that  is  driven  by  w,  and  produces  an  output  that  is 

-1 

proportional  to  (<2^2^)  (LUTS1) ,  and  a  LUTS  that  is  driven  by  p jU^, 
Wj^  and  (v^w^>)  ^  and  gives  two  outputs,  one  that  is  proportional  to 
and  the  other  proportional  to  (LUTS2)  .  Similar  LUTS  are  needed 

for  the  calculation  of  <w^w^>  and  p ^  (LUTS3  and  LUTS4) .  Figures  9.1 
and  9.2  show  typical  examples  of  LUTS1  and  LUTS2.  In  these  figures 
each  LUT  has  2  inputs  and  1  output.  The  top  part  of  the  LUT  shows  the 
operation  performed  (multiplication  (*)  or  addition  (+))  with  respect 
to  modulo  m.  The  middle  part  of  the  LUT  shows  the  implementation  by 
wire  leaps  of  functions  such  as  dot  product  complement  (<  >)  and  inverse 
(l/<  >) ;  when  the  middle  part  is  blank,  no  operation  is  performed 
there.  The  lower  part  symbolizes  the  output  detectors.  In  the  LUTS  of 
Figures  9.1  and  9.2,  one  can  also  see  blocks  that  denote  delay (s) 
(denoted  by  D) ,  which  are  necessary  for  data  synchronization.  To 
simplify,  we  designate  the  equivalent  block  diagrams  of  the  various 
LUTS  structures  and  delays  with  the  blocks  of  Figure  9.3,  where  the 
number  on  the  top  right  corner  shows  the  total  delay  (in  clock  cycles) 
that  the  LUTS  needs  in  order  to  provide  the  lower  output,  a  useful 
quantity  for  calculations  of  delays  needed  for  data  synchronization. 


With  the  aid  of  Figure  9.4,  we  now  describe  a  pipelined  processor  that 
calculates  w^  and  p ^  for  a  6x6  example.  It  can  be  seen  that  the 
processor  uses  LUTS1  through  LUTS4  plus  delays  and  adder  LUTs.  The 
inputs  of  the  system  are  the  elements  of  the  matrix  A.  We  assume  that 
all  the  elements  of  A  are  fed  into  the  system  in  parallel.  This  is  not 
a  necessary  condition  and  is  adopted  mostly  for  simplicity.  The 
elements  of  A  are  fed  through  6  row  lines  (i.e.,  a  total  of  36  lines) 
which  are  located  at  the  top  left  of  Figure  9.4.  The  top  row  of  the 
system  consists  of  one  LUTS4  and  5  LUTS3.  The  former  provides  <w^w.> 
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Figure  9.3  Equivalent  Block  Diagrams  for  Look-Up  Table  Sets  1  through 
4  and  Delay  Unit  for  Use  in  Subsequent  Pipeline  Processor 
Schematics. 


which  is  needed  for  the  calculation  of  the  dg  through  Cg  coefficients. 
Its  output  is  fed  in  parallel  to  the  5  LUTS3  which  calculate  the  above 
coefficients  and  the  products  of  the  coefficients  with  the  proper  u^’s. 
The  latter  are  needed  for  the  calculation  of  the  various  w^  [see 
Eq.  (9.33)].  For  our  6x6  exaaple  *  it  takes  7  cycles  for  the 
calculation  of  w^  (the  7th  cycle  being  needed  for  the  addition  of  ^-2 
and  w^  =  =  a^)  and  5  cycles  for  the  calculation  of  <*2~£S 

coefficients.  The  Wg  output  is  connected  to  a  LUTS1  unit  while  the 
other  outputs  (delayed  by  one  clock  cycle)  are  fed  into  4  LUTS2  which 
after  7  cycles  calculate  w^  and  the  coefficients  l2~*2‘  that  we 

now  need  LUT  adders  in  order  to  add  the  to  @2-2*  74-4  to  72-2’ 

and  so  on.  This  process  continues  for  4  more  rows  until  all  w^  and  all 
the  coefficients  of  the  matrices  are  calculated.  It  is  important  to 
note  that  the  processor  of  Figure  9.4  operates  in  a  pipelined  fashion 
and  thus  constantly  updates  the  vectors  w^  and  the  coefficients  a-e. 
This  is  very  important  for  the  APAB  scenarios  where  one  wants  to 
constantly  update  the  adaptive  weights.  Finally,  we  note  that  the 
processor  provides  output  data  (vectors  and  coefficients)  every  7  clock 
cycles  (see  Figure  9.4). 

We  now  proceed  to  describe  another  pipelined  processor  (Figure  9.5) 
which  is  used  in  order  to  calculate  the  elements  of  the  E  matrix  which 
in  turn  are  necessary  for  calculating  Eq.  (9.32).  For  our  6x6  example 
one  can  easily  show  that  the  elements  of  the  triangonal  E  matrix  are 
given  by 
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Figure  9.5  Pipelined  Processor  for  Calculating  the  Products  of  E 
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The  fora  of  the  elements  shows  that  in  calculating  the  e.^  element  we 
need  to  know  all  the  e. .  ,  elements  as  well  as  all  the  u.  * 
coefficients.  This  implies  that  another  serial-type  operation  is 
necessary.  This  is  exactly  what  the  processor  of  Figure  0.5  performs; 
the  e^j  element  is  calculated  only  after  the  previous  e^, 
a=l,2, . . . , j-1,  elements  have  been  calculated.  Note  that  in  order  to 
avoid  unnecessary  complexity  we  have  to  rearrange  the  order  in  which  we 
receive  some  of  the  coefficients  from  the  previous  processor. 
Specifically,  we  have  to  delay  the  first  set  of  coefficients  by  an 
aaount  such  that  we  receive  all  *2~*2  ^*  saa*  c^oc^  cycle,  all 

the  same  clock  cycle,  etc.;  thus,  we  need  to  delay  a ^  by  one 
clock  cycle,  by  two  clock  cycles,  etc.  Once  this  is  done, 
coefficients  with  the  same  subscript  will  arrive  in  parallel  (see  top 
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of  Figure  9.5).  It  is  important  to  remember  that  each  set  of 
coefficients  arrives  7  cycles  after  the  previous  set,  because  this 
explains  the  choice  of  the  delays  we  use.  The  operations  necessary  for 
calculating  Eqs.  (9.35)  can  be  achieved  via  the  use  of  multiplier  and 
adder  LUTs  as  well  as  delays.  The  ^2~€2  c^a*'a  are  parallel  into 

4  multiplier  LUTs  which  form  the  products  with  a^.  The  outputs  are 
then  fed  into  4  adder  LUTs  which  form  the  sums  with  the  data. 

Note  that  although  all  coefficients  with  subscript  1  are  equal  to  1,  we 
treat  them  as  unknowns  in  order  to  generalise  the  design  of  the 
processor.  Since  data  are  ^  cycles  behind  the  data,  we 

need  to  delay  the  latter  by  8  cycles  in  order  to  have  them  available 
for  the  addition.  The  result  under  the  p ^  line  (equal  to  e^»  see 
Eq.  (9.35)),  is  needed  for  the  calculation  of  the  products  with  the 
next  set  of  data,  which  will  arrive  a  total  of  5  cycles  later.  This 
result  (e^)  is  delayed  by  6  cycles  and  then  fed  to  3  multipliers  which 
are  also  driven  by  data  These  results  are  then  added  to  the 

results  under  lines  72~€2  ^e  ®14  element  i®  computed.  This 
pipelined  process  continues  until  all  e^  elements  are  computed.  Note 
that  in  parallel  to  the  e^  calculation,  we  also  perform  the  e2i-egi 
calculation  (the  e^  element  does  not  depend  on  the  ej_^  ^  element). 
This  is  accomplished  by  driving  sets  of  units  similar  to  the  ones  we 
used  for  the  calculation  of  the  e^  elements.  Due  to  the  triangular 
form  of  the  matrix  E,  the  number  of  units  necessary  for  the  calculation 
of  elements  ej_^  ^  is  reduced  by  one  as  compared  with  the  number  of 
units  needed  for  elements  e^.  Due  to  the  pipelining  process,  the 
parallel  operations  and  the  natural  delays  of  the  coefficients,  the 
E  matrix  elements  are  computed  so  that  elements  with  the  same  second 
subscript  are  produced  in  parallel  (see  Figure  9.5)  and  7  cycles  after 
the  previous  set.  Thus,  once  again  we  have  achieved  the  pipelining 
which  is  important  for  high  speed  processing. 

For  the  evaluation  of  Eq.  (9.32),  we  also  need  to  calculate  the 
P  matrix.  One  can  easily  prove  that  each  row  vector  is  given  by 


e-10 


where 


*i 


w. 

-l 


(9 


Xi  =  1/<-i  2i>_1  (9 

Eq.  (0.37)  has  already  been  calculated  in  the  processor  of  Figure  9.4 
because  it  is  needed  for  the  calculation  of  the  a,  P,  ....,  e 
coefficients.  Since  w^  appear  4  cycles  ahead  of  X.^,  we  need  to  delay 
then  by  4  cycles  and  subsequently  multiply  them  by  X^X.  This  is  shown 
in  Figure  9.6  for  all  6  row  vectors  of  the  P  matrix.  Ve  are  thus 
capable  of  producing  all  of  the  elements  of  a  row  vector  of  the 
P  matrix  every  7  clock  cycles. 

This  final  step  in  calculating  Eq.  (9.32)  involves  a  matrix-matrix 
multiplication;  i.e.,  EP.  To  achieve  this  we  can  use  the  array 
processor  of  Figure  9.7.  This  system  consists  of  36  similar  units 
arranged  in  a  square  format.  Each  unit  consists  of  a  multiplier  and  an 
adder  LUT  as  well  as  a  delay.  The  LUTs  are  arranged  so  that  each 
product  is  added  to  the  previous  one  (i.e.,  we  form  a 
multiplier/accumulator) .  Each  set  of  column  units  is  driven  in 
parallel  by  the  appropriate  E  data,  and  each  set  of  row  units  is  driven 
in  parallel  by  the  appropriate  P  data.  Upon  summation  of  6  products, 
the  adders  are  read  out,  and  each  output  is  an  element  of  the  C  matrix. 
Note  that  the  sequential  format  required  for  both  E  and  P  data  is  the 
same  as  the  sequential  format  of  the  data  that  leave  the  processors  of 
Figure  9.5  and  9.6.  Ve  must  multiplex  the  data,  however,  because  the  E 
and  P  data  come  from  21  and  38  output  lines,  respectively  (see 
Figures  9.5  and  9.6),  whereas  there  are  only  6  input  lines  (per  side) 
for  the  processor  of  Figure  9.7.  This  is  not  difficult  since 
successive  rows  (columns)  of  the  E(P)  data  appear  every  7  cycles. 
Furthermore,  by  use  of  appropriate  delays,  we  can  provide  the 
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Figure  9.7  Array  Processor  for  Calculating  A 


successive  data  at  one  cycle  intervals  (instead  of  7)  and  allow  the 
array  processor  to  aaintain  the  pipelining  for  at  least  7  cycles,  which 
Mans  that  a  new  Matrix  can  be  read  out  every  7  cycles. 


0.6  Processor  Characteristics 


Let  us  now  discuss  the  system  delay  Tg,  which  is  defined  as  the  time 

required  to  provide  the  matrix  C  after  loading  matrix  A.  From 
Figure  0.4  we  see  that  the  calculation  of  the  6g  coefficients  is  the 
most  time-consuming  operation.  It  requires  a  total  of 


h  *  fr-1)  *  (td4  *  tc) 


(9.3 


where  n  is  the  dimension  of  matrix  A,  and  t^  is  the  delay  of  LUTS4 

and  t  *  0  is  equal  to  the  duration  of  a  clock  cycle.  The  next  delay 
c 

cones  from  the  processor  of  Figure  0.5  and  is  proportional  to 


t  ♦  tAA 
c  d4 


(fl- 


Ve  now  take  into  account  the  delay  necessary  for  interfacing  the  systems 
of  Figures  0.4  and  0.5.  A  simple  analysis  shows  that  this  delay  is 
proportional  to 


t3  -  (n-l)  z  tJ4  (S. 

Finally,  the  total  delay  of  the  array  processor  of  Figure  0.7  is  of  the 
order  of 


t.  *  (n*2)  x  t 


(0.4 


From  Eqs.  (9.38)-(9.41) ,  we  find  that  the  total  delay  of  the  system  can 
be  approximated  by 

Ts  =  2n  (fcd4  *  tc)  (9-42) 

To  express  Tg  as  a  function  of  t£  we  need  to  calculate  the  delay  t^. 
Inspection  of  Figure  9.2  reveals  that  the  total  delay  is  a  function  of 
n.  A  simple  analysis  shows  that  the  total  delay  is  proportional  to 

td4  s  (L  ♦  3)  x  tc  (9.43) 

where  L  is  an  integer  which  satisfies  2^  £  n  <  2^+^.  From  Eq.  (9.42) 
and  (9.43),  we  find  that  the  system  delay  is  proportional  to 

T  x  2n  (L  +  4)  t  (9.44) 

s  c 

i.e.,  the  system  delay  is  a  linear  function  of  n.  Thus  for  our  6x6 
example,  the  total  time  required  to  invert  the  first  matrix  is 

Tg  =  72  tc  (9.45) 

Assuming  a  clock  cycle  of  the  order  of  4  nsec,  we  find  that  T  is  of 

the  order  of  0.4  psec.  Similarly,  for  a  12x12  example,  T  is  of  the 

s 

order  of  0.9  /tsec. 

So  far  we  have  considered  the  total  delay  of  the  system,  i.e.,  the  time 
to  invert  the  first  matrix.  Given  the  pipelined  process,  however,  the 
second  matrix  will  be  inverted  after  a  time  Tq  which  is  proportional  to 


This  is  equal  to  the  tine  required  for  the  formation  of  the 
matrix-matrix  multiplication  performed  by  the  system  of  Figure  9.7.  In 
this  case  and  for  a  12  x  12  system,  each  matrix  inversion  takes  about 
48  nsec.  Note  that  this  dictates  the  time  delay  necessary  for  loading 
consecutive  matrices  into  the  system  of  Figure  9.4,  which  is  equal  to 

T.  =  nt  (9 

1  c  v 

Ve  now  estimate  the  total  number  of  LUTs  required  by  the  processors 
of  Figures  9. 4-9. 7.  From  Figure  9.4  we  see  that  the  total  number  of 
LUTS1  (or  LUTS3)  needed  is  n,  whereas  the  total  number  of  LUTS2  (or 
LUTS4)  is  n(n-l)/2.  The  number  of  LUTs  in  each  LUTS1  (or  LUTS3)  is 
about  n.  Similarly,  the  number  of  LUTs  in  each  LUTS2  (or  LUTS4)  is 
about  3n.  Finally,  we  need  about  n(n+3)/2  adder  LUTs.  Thus,  the  total 
number  of  LUTs  in  the  processor  of  Figure  9.4  is 

Nj  =  nxn  ♦  3n2(n-l)/2  +  n(n+3)/2  =  3n(n2+l)/2  (9 

Similarly,  the  total  number  of  LUTs  for  the  processor  of  Figure  9.5  is 

3  2 

-  n  /4,  and  for  the  processor  of  Figure  9.7  is  2n  .  Thus  the  total 

number  of  LUTs  required  per  modulo  is 

Nt  =  N1  +  q3/4  *  2°2  =  (7“3  +  8q2  +  8n)/4  (9 

Note  that  if  *r  is  the  number  of  moduli  used,  then  the  total  number  of 

LUTs  needed  is  ■  Nt. 

r 

For  our  0x8  example  and  with  8-bit  input  accuracy,  we  find  from 

13 

Sq  9  5)  that  M  must  bound  1.4  x  10  To  handle  this  value  we  use  11 


moduli:  7,  9,  11,  13,  17,  19,  23,  25,  29,  31  and  37,  and  in  this  case 

the  total  number  of  LUTs  becomes  =  5,000.  Note  that  these  results 

reflect  the  fact  that  we  chose  to  use  a  high  degree  of  parallel 

processing  which  results  in  a  LUT  number  requirement  that  is 
3 

proportional  to  n  .  This  requirement  can  be  reduced  considerably  if  we 
choose  to  use  more  of  a  serial-type  processor;  this,  however,  will 
reduce  the  speed  of  the  processor.  Such  issues  require  trade-off 
analyses  in  order  to  show  clearly  the  optimum  system  architecture  once 
the  convergence  time  requirement  is  defined. 


10.  CONCLUSIONS  AND  RECOMMENDATIONS 


In  this  prograa  we  have  examined  the  possibility  of  using  DMAC-based  AO 
processors  for  solving  eigensystems  in  conjunction  with  the  APAR 
problem.  Study  of  existing  eigensytem  solution  algorithms  has  revealed 
that  many  of  the  required  logical  and  arithmetic  operations  cannot  be 
provided  by  the  AO  processors.  An  analysis  of  various  classes  of  AO 
processors,  that  are  based  on  DUAC  and  its  parallel  extension  BP AM,  has 
clearly  shown  that  this  type  of  AO  system  offers  no  advantage  over 
existing  all-electronic  systems.  Therefore,  we  do  not  consider  this  to 
be  a  viable  approach. 

We  have  suggested  that  optical  interconnections  will  allow  electronic 
digital  multipliers,  in  square  array  formats,  to  be  globally 
interconnected.  At  high  processing  speeds  (2  500  MHz)  optical 
interconnections  seem  to  be  the  only  choice.  These,  in  conjunction  with 
global  communications,  will  enhance  the  processing  speed.  We  have 
suggested  a  simple  but  .efficient  fiber-optic  technique  that  allows  for 
global  interconnections  and  we  have  fabricated  a  prototype  optically 
addressed  digital  multiplier.  Much  work  is  needed  in  this  area.  We 
suggest  that  further  analyses  be  carried  out  of  an  optically 
interconnected  square  array  for  matrix-matrix  multiplication  and  that  a 
prototype  array  be  built  and  evaluated. 

Residue-based  LUT  processing  has  been  considered.  We  have  proposed  a 
laser  diode-based  LUT  which  can  be  fabricated  with  present  technology 
and  have  fabricated  and  tested  a  modulo  7  prototype  LUT.  The  results 
suggest  that  when  monolithic  LUTs  are  developed,  switching  speeds  that 
exceed  1  GHz  should  be  easily  achievable.  We  suggest  that  LUT  modelling 
and  analysis  take  place  so  that  the  problem  of  pulse  reflections  can  be 
minimised.  The  number  of  laser  diodes  in  the  fabricated  LUT  grows  as 
the  square  of  the  modulo.  In  order  to  avoid  an  excessive  number  of 


laser  diodes  novel  LUT  architectures  need  to  be  developed  and  we 
recommend  that  additional  research  be  done  in  this  area.  We  have  shown 
that  B/R  and  R/B  conversions  can  be  efficiently  implemented  via  the  use 
of  LUTs.  An  example  of  LUT  array  processing  has  shown  that  these 
conversions  require  about  20%  of  the  total  hardware.  This  suggests  that 
the  longer  the  processing  in  the  RNS  the  less  the  relative  hardware 
needed  for  conversions.  For  the  above  example  we  have  also  shown  that 
the  total  number  of  gates  for  the  RNS  LUT  processing  is  about  twice  that 
of  the  electronic  counterpart.  This  suggests  that  comparison  analyses 
be  done  in  order  to  identify  both  the  competitiveness  of  LUT  processing 
as  well  as  the  applications  for  which  RNS  LUT  processing  is  well  suited. 

The  RNS  LUT  processing  has  also  been  studied  for  use  in  the  APAR  area. 

We  have  found  that  the  only  algorithm  suited  for  residue  LUT 
implementation  is  a  variant  of  the  Gram-Schmidt  orthogonalization 
procedure.  We  have  shown,  through  examples,  that  such  an  approach 
yields  the  correct  results.  We  recommend  that  this  approach  be  further 
analyzed  in  order  to  determine  its  exact  requirements  and  shortcomings. 
Finally,  we  have  presented  the  complete  design  of  a  pipelined  RNS  LUT 
processor  for  the  inversion  of  a  8x6  AFAR  data  matrix.  We  have  shown 
that  with  a  fully  parallel  implementation,  the  matrix  inversion  takes 
place  in  N+l  cycles.  Note  that  for  such  an  implementation,  the  total 
number  of  LUTs  grows  as  ~  2N^.  Thus,  in  order  to  avoid  an  excessive 
number  of  LUTs  we  recommend  that  a  similar  design  be  made  with  a  more 
serial  nature.  Such  a  design  should  clearly  show  the  trade-offs  between 
processing  speed  and  hardware  complexity. 
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APPENDIX  A 

AN  ANALYSIS  FOR  CIRCULARLY  POLARIZED  SAMPLING 


This  appendix  describes  the  mathematical  validity  of  the  circularly 
polarised  sampling  by  observing  the  sampled  function  in  the  frequency 
domain. 

Consider  a  complex  function,  f(t),  that  has  a  real  and  an  imaginary  part: 


f (t)  =  fR(t)  ♦  fj(t)  (A-l) 

The  sampling  function  is  a  series  of  delta  functions,  the  phase  of  which  is 
shifted  by  00*  (Figure  A.l)  or  the  complex  amplitude  part  forms  a  series 
(1,  j,  -1,  -j,  1,  j  In  the  first  sampling  period,  it  samples  the 

real  part  of  the  input  function;  in  the  second  sampling  period,  it  samples 
the  imaginary  part;  in  the  third  period,  it  samples  the  real  part  with 
negative  polarity;  and  in  the  fourth  period,  it  samples  the  imaginary  part 
with  negative  polarity.  This  four-cycle  pattern  is  repeated  for  the 
remainder  of  the  sampling  operation.  Ve  call  this  function  a  Right-Hand 
Circularly  Polarised  (RCP)  sampling  function,  indicating  the  rotational 
orientation  of  the  phasor.  Similarly,  a  Left-Hand  Circularly  Polarized 
(LCP)  sampling  function  is  a  delta  function  series  with  the  quadrature 
rotating  in  the  opposite  direction  (1,  -j ,  -1,  j,  ...)  and  the  sampling 
operation  can  be  performed  in  a  similar  way. 

Mathematically  this  operation  can  be  described  as  follows: 

Let  *R£p (t)  “d  sLcp(t)  be  the  RCP  and  LCP  sampling  functions, 
respectively ,  i . e . , 
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Figure  A.l  Circular ly  polarised  sampling  function. 
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The  sampled  versions  of  the . functions  *aCp(t)  and  f^cp(t)  are  obtained 
by  evaluating  the  real  part  of  the  product  of  f(t)  and  the  sampling 
functions 
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Therefore  fg^p(t)  =  f^Qp(t),  i.e. .  the  ECP  sampled  signal  and  the  LCP 
sampled  signal  are  conjugate  to  each  other. 

The  spectra  of  these  signals  can  be  obtained  by  Fourier  transformation 
of  the  expressions: 


after  dropping  unnecessary  coefficients  (see  Figure  A. 2) .  These 
equations  indicate  that  the  frequency  domain  contains  the  replications 
of  the  original  spectra  as  in  the  case  of  ordinary  sampling.  The 
differences  are  that  the  primary  spectra  are  located  at  l/2t  and  that 
the  adjacent  aliases  are  conjugate  to  each  other.  The  spacing  between 
the  alias  centers  is  1/t  .  Therefore,  the  bandwidth  of  the  original 
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function  B  must  be  smaller  than  that  of  the  sampling  frequency,  i.e., 
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